Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is speech recognition difficult? [closed]

Why is speech recognition so difficult? What are the specific challenges involved? I've read through a question on speech recognition, which did partially answer some of my questions, but the answers were largely anecdotal rather than technical. It also still didn't really answer why we still can't just throw more hardware at the problem.

I've seen tools that perform automated noise reduction using neural nets and ambient FFT analysis with excellent results, so I can't see a reason why we're still struggling with noise except in difficult scenarios like ludicrously loud background noise or multiple speech sources.

Beyond this, isn't it just a case of using very large, complex, well-trained neural nets to do the processing, then throwing hardware at it to make it work fast enough?

I understand that strong accents are a problem and that we all have our colloquialisms, but these recognition engines still get basic things wrong when the person is speaking in a slow and clear American or British accent.

So, what's the deal? What technical problems are there that make it still so difficult for a computer to understand me?

like image 578
Polynomial Avatar asked Dec 05 '11 07:12

Polynomial


People also ask

What are the difficulties in speech recognition?

Imprecision and false interpretations. Speech recognition software isn't always able to interpret spoken words correctly. This is due to computers not being on par with humans in understanding the contextual relation of words and sentences, causing misinterpretations of what the speaker meant to say or achieve.

Is making voice recognition hard?

Speech recognition is hard because listening is harder and more complicated than we naively think. Let's look at what we do and what a machine would need to do: We have the physiology and anatomy to accept an acoustic wave.

Why is speech recognition not working?

Here are some things to check first if voice typing isn't working: Make sure the microphone you want to use is selected in Settings. To check, select Start > Settings > System > Sound > Input > Choose a device for speaking or recording.

Is speech recognition secure?

Yes, voice recognition is secure, especially when compared to classic logins that require a username and password. Similar to other biometrics, voice recognition is more secure because a person must interact with a login rather than simply enter a code.


2 Answers

Some technical reasons:

  • You need lots of tagged training data, which can be difficult to acquire once you take into account all the different accents, sounds etc.
  • Neural networks and similar gradient descent algorithms don't scale that well - just making them bigger (more layers, more nodes, more connections) doesn't guarantee that they will learn to solve your problem in a reasonable time. Scaling up machine learning to solve complex tasks is still a hard, unsolved problem.
  • Many machine learning approaches require normalised data (e.g. a defined start point, a standard pitch, a standard speed). They don't work well once you move outside these parameters. There are techniques such as convolutional neural networks etc. to tackle these problems, but they all add complexity and require a lot of expert fine-tuning.
  • Data size for speech can be quite large - the size of the data makes the engineering problems and computational requirements a little more challenging.
  • Speech data usually needs to be interpreted in context for full understanding - the human brain is remarkably good at "filling in the blanks" based on understood context. Missing informations and different interpretations are filled in with the help of other modalities (like vision). Current algorithms don't "understand" context so they can't use this to help interpret the speech data. This is particularly problematic because many sounds / words are ambiguous unless taken in context.

Overall, speech recognition is a complex task. Not unsolvably hard, but hard enough that you shouldn't expect any sudden miracles and it will certainly keep many reasearchers busy for many more years.....

like image 106
mikera Avatar answered Nov 16 '22 04:11

mikera


Humans use more than their ears when listening, they use the knowledge they have about the speaker and the subject. Words are not arbitrarily sequenced together, there is a grammatical structure and redundancy that humans use to predict words not yet spoken. Furthermore, idioms and how we ’usually’ say things makes prediction even easier.

In Speech Recognition we only have the speech signal. We can of course construct a model for the grammatical structure and use some kind of statistical model to improve prediction, but there are still the problem of how to model world knowledge, the knowledge of the speaker and encyclopedic knowledge. We can, of course, not model world knowledge exhaustively, but an interesting question is how much we actually need in the ASR to measure up to human comprehension.

Speech is uttered in an environment of sounds, a clock ticking, a computer humming, a radio playing somewhere down the corridor, another human speaker in the background etc. This is usually called noise, i.e., unwanted information in the speech signal. In Speech Recognition we have to identify and filter out these noises from the speech signal. Spoken language != Written language

1: Continuous speech

2: Channel variability

3: Speaker variability

4: Speaking style

5: Speed of speech

6: Ambiguity

All this points have to be considered while building a speech recognition, That's why its a quite difficult.

-------------Refered from http://www.speech.kth.se/~rolf/gslt_papers/MarkusForsberg.pdf

like image 40
FosterZ Avatar answered Nov 16 '22 04:11

FosterZ