Why is speech recognition difficult? [closed]

Tags:

Why is speech recognition so difficult? What are the specific challenges involved? I've read through a question on speech recognition, which did partially answer some of my questions, but the answers were largely anecdotal rather than technical. It also still didn't really answer why we still can't just throw more hardware at the problem.

I've seen tools that perform automated noise reduction using neural nets and ambient FFT analysis with excellent results, so I can't see a reason why we're still struggling with noise except in difficult scenarios like ludicrously loud background noise or multiple speech sources.

Beyond this, isn't it just a case of using very large, complex, well-trained neural nets to do the processing, then throwing hardware at it to make it work fast enough?

I understand that strong accents are a problem and that we all have our colloquialisms, but these recognition engines still get basic things wrong when the person is speaking in a slow and clear American or British accent.

So, what's the deal? What technical problems are there that make it still so difficult for a computer to understand me?

578

asked Dec 05 '11 07:12

Polynomial

2 Answers

Some technical reasons:

You need lots of tagged training data, which can be difficult to acquire once you take into account all the different accents, sounds etc.
Neural networks and similar gradient descent algorithms don't scale that well - just making them bigger (more layers, more nodes, more connections) doesn't guarantee that they will learn to solve your problem in a reasonable time. Scaling up machine learning to solve complex tasks is still a hard, unsolved problem.
Many machine learning approaches require normalised data (e.g. a defined start point, a standard pitch, a standard speed). They don't work well once you move outside these parameters. There are techniques such as convolutional neural networks etc. to tackle these problems, but they all add complexity and require a lot of expert fine-tuning.
Data size for speech can be quite large - the size of the data makes the engineering problems and computational requirements a little more challenging.
Speech data usually needs to be interpreted in context for full understanding - the human brain is remarkably good at "filling in the blanks" based on understood context. Missing informations and different interpretations are filled in with the help of other modalities (like vision). Current algorithms don't "understand" context so they can't use this to help interpret the speech data. This is particularly problematic because many sounds / words are ambiguous unless taken in context.

Overall, speech recognition is a complex task. Not unsolvably hard, but hard enough that you shouldn't expect any sudden miracles and it will certainly keep many reasearchers busy for many more years.....

106

answered Nov 16 '22 04:11

mikera

Humans use more than their ears when listening, they use the knowledge they have about the speaker and the subject. Words are not arbitrarily sequenced together, there is a grammatical structure and redundancy that humans use to predict words not yet spoken. Furthermore, idioms and how we ’usually’ say things makes prediction even easier.

In Speech Recognition we only have the speech signal. We can of course construct a model for the grammatical structure and use some kind of statistical model to improve prediction, but there are still the problem of how to model world knowledge, the knowledge of the speaker and encyclopedic knowledge. We can, of course, not model world knowledge exhaustively, but an interesting question is how much we actually need in the ASR to measure up to human comprehension.

Speech is uttered in an environment of sounds, a clock ticking, a computer humming, a radio playing somewhere down the corridor, another human speaker in the background etc. This is usually called noise, i.e., unwanted information in the speech signal. In Speech Recognition we have to identify and filter out these noises from the speech signal. Spoken language != Written language

1: Continuous speech

2: Channel variability

3: Speaker variability

4: Speaking style

5: Speed of speech

6: Ambiguity

All this points have to be considered while building a speech recognition, That's why its a quite difficult.

-------------Refered from http://www.speech.kth.se/~rolf/gslt_papers/MarkusForsberg.pdf

answered Nov 16 '22 04:11

FosterZ

Related questions
                            
                                Numbers ending in 3 have at least one multiple having all ones
                            
                                Find Top 10 Most Frequent visited URl, data is stored across network
                            
                                Embedded youtube video with "autoplay=1". Does it count towards views?
                            
                                Linear indexing in symmetric matrices
                            
                                how to order vertices in a non-convex polygon (how to find one of many solutions)
                            
                                Triangle / Circle enclosing a set of points
                            
                                What is the basic difference between Bellman-ford and Floyd warshall algorithm?
                            
                                How to improve efficiency of algorithm which generates next lexicographic permutation?
                            
                                Efficient way to filter out elements from std::vector
                            
                                How can I return an array of struct in solidity?
                            
                                Synchronisation algorithms
                            
                                small cycle finding in a planar graph
                            
                                Is there such a thing as "negative" big-O complexity? [duplicate]
                            
                                Should an octree be rebuilt every frame?
                            
                                How to find same-value rectangular areas of a given size in a matrix most efficiently?
                            
                                Powerful algorithms too complex to implement [closed]
                            
                                when to resize a hash table?
                            
                                Negative Weight Cycle Algorithm
                            
                                kadane algorithm in java
                            
                                Pass std algos predicates by reference in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is speech recognition difficult? [closed]

Tags:

algorithm

theory

speech-recognition

Polynomial

People also ask

2 Answers

mikera

FosterZ

Recent Activity

Donate For Us