Why is speech recognition so difficult? What are the specific challenges involved? I've read through a question on speech recognition, which did partially answer some of my questions, but the answers were largely anecdotal rather than technical. It also still didn't really answer why we still can't just throw more hardware at the problem.
I've seen tools that perform automated noise reduction using neural nets and ambient FFT analysis with excellent results, so I can't see a reason why we're still struggling with noise except in difficult scenarios like ludicrously loud background noise or multiple speech sources.
Beyond this, isn't it just a case of using very large, complex, well-trained neural nets to do the processing, then throwing hardware at it to make it work fast enough?
I understand that strong accents are a problem and that we all have our colloquialisms, but these recognition engines still get basic things wrong when the person is speaking in a slow and clear American or British accent.
So, what's the deal? What technical problems are there that make it still so difficult for a computer to understand me?
Imprecision and false interpretations. Speech recognition software isn't always able to interpret spoken words correctly. This is due to computers not being on par with humans in understanding the contextual relation of words and sentences, causing misinterpretations of what the speaker meant to say or achieve.
Speech recognition is hard because listening is harder and more complicated than we naively think. Let's look at what we do and what a machine would need to do: We have the physiology and anatomy to accept an acoustic wave.
Here are some things to check first if voice typing isn't working: Make sure the microphone you want to use is selected in Settings. To check, select Start > Settings > System > Sound > Input > Choose a device for speaking or recording.
Yes, voice recognition is secure, especially when compared to classic logins that require a username and password. Similar to other biometrics, voice recognition is more secure because a person must interact with a login rather than simply enter a code.
Some technical reasons:
Overall, speech recognition is a complex task. Not unsolvably hard, but hard enough that you shouldn't expect any sudden miracles and it will certainly keep many reasearchers busy for many more years.....
Humans use more than their ears when listening, they use the knowledge they have about the speaker and the subject. Words are not arbitrarily sequenced together, there is a grammatical structure and redundancy that humans use to predict words not yet spoken. Furthermore, idioms and how we ’usually’ say things makes prediction even easier.
In Speech Recognition we only have the speech signal. We can of course construct a model for the grammatical structure and use some kind of statistical model to improve prediction, but there are still the problem of how to model world knowledge, the knowledge of the speaker and encyclopedic knowledge. We can, of course, not model world knowledge exhaustively, but an interesting question is how much we actually need in the ASR to measure up to human comprehension.
Speech is uttered in an environment of sounds, a clock ticking, a computer humming, a radio playing somewhere down the corridor, another human speaker in the background etc. This is usually called noise, i.e., unwanted information in the speech signal. In Speech Recognition we have to identify and filter out these noises from the speech signal. Spoken language != Written language
1: Continuous speech
2: Channel variability
3: Speaker variability
4: Speaking style
5: Speed of speech
6: Ambiguity
All this points have to be considered while building a speech recognition, That's why its a quite difficult.
-------------Refered from http://www.speech.kth.se/~rolf/gslt_papers/MarkusForsberg.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With