Hugging Face Demo: https://huggingface.co/spaces/aiola/whisper-ner-v1
Pretty good article that focuses on the privacy/security aspect of this — having a single model that does ASR and NER:
https://venturebeat.com/ai/aiola-unveils-open-source-ai-audi...
What advantage does this offer?
And in all the research we did, the best solutions ended up passing through a workflow of 1.NN based NER, 2.Regex and 3.Dictionary look ups to really clean information. Using a single method worked well in customer demos but always ended up in what we thought were edge cases in prod.
That being said, latency stuff makes sense. This might work great in conversational use cases. Picking out intent and responding. Every millisecond helps in making things sound natural.
So if the first step isn't near-perfect (which ASR isn't) and if there is some information or "world knowledge" in the later step(s) which is helpful in deciding that (which is true with respect to knowledge about named entities and ASR) then you can get better accuracy by having an end-to-end system where you don't attempt to pick just one best option at the system boundary. Also, joint training can be helpful, but that IMHO is less important.
These ASR errors cascade into the NER step, further degrading recall and precision. Combining ASR and NER into a joint model or integrated approach can reduce these issues in theory, it's just more complex to implement and less commonly used.
Impressive, very impressive. I wonder if it could listen for credit cards or passwords.
"Say 'service' for customer service"
"service"
[TAG: complainer]
<click>
I'm building an assistant that gives information on local medical providers that match your criteria. I'm struggling with query expansion and entity recognition. For any incoming query, I would want to NER for medical terms (which are limited in scope and pre-determined), and subsequently where I would do Query rewriting and expansion.
This of course means that we now have to think about all the irreconcilable problems of taxonomy, but I'll take that any day over the old version :)
There are open source NER models that can identify any specified entity type (https://universal-ner.github.io/, https://github.com/urchade/GLiNER). I don't see why this WhisperNER approach would be any better than doing ASR with whisper and then applying one of these NER models.