What's so difficult about the subject that algorithm designers are having a hard time tackling it?
Is it really that complex?
I'm having a hard time grasping why this topic is so problematic. Can anyone give me an example as to why this is the case?
ASR systems are not accurately processing and understanding human speech due to background noise, multiple people talking, signal disruption, and distance.
There are limitations to speech recognition software. It does not always work across all operating systems. Noisy environments, accents and multiple speakers may degrade results. Also, regular voice recognition software can lack integration with other key services.
Right-click on the Start button and select Settings. In the Settings menu, go to the Accessibility tab on the list on the left-hand side. In the right-pane, under the Interaction column, please select Speech. Here, you can turn OFF the switch associated with Windows Speech Recognition.
Google Speech-to-Text API Key features: The key features of Google Speech-to-Text API include: High accuracy: It has an accuracy rate of 80-85%. Transcription capabilities: It can transcribe audio in 125+ languages and variants, including pre-recorded and real-time audio.
Auditory processing is a very complex task. Human evolution has produced a system so good that we don't realize how good it is. If three persons are talking to you at the same time you will be able to focus in one signal and discard the others, even if they are louder. Noise is very well discarded too. In fact, if you hear human voice played backwards, the first stages of the auditory system will send this signal to a different processing area than if it is real speech signal, because the system will regard it as "no-voice". This is an example of the outstanding abilities humans have.
Speech recognition advanced quickly from the 70s because researchers were studying the production of voice. This is a simpler system: vocal chords excited or not, resonation of vocal tractus... it is a mechanical system easy to understand. The main product of this approach is the cepstral analysis. This led automatic speech recognition (ASR) to achieve acceptable results. But this is a sub-optimal approach. Noise separation is quite bad, even when it works more or less in clean environments, it is not going to work with loud music in the background, not as humans will.
The optimal approach depends on the understanding of the auditory system. Its first stages in the cochlea, the inferior colliculus... but also the brain is involved. And we don't know so much about this. It is being a difficult change of paradigm.
Professor Hynek Hermansky compared in a paper the current state of the research with when humans wanted to fly. We didn't know what was the secret —The feathers? wings flapping?— until we discovered Bernoulli's force.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With