How to get started with speech-to-text?

Tags:

I'm really interested in speech-to-text algorithms, but I'm not sure where to start studying up on them. A bunch of searching around led me to this, but it's from 1996 and I'm fairly certain that there have been improvements since then.

Does anyone who has any experience with this sort of stuff have any recommendations for reading / source code to examine? Or just general advice on what I should be trying to learn about if I want to get into the world of writing speech recognition programs (sometimes it's hard to know what to search for if you don't have much knowledge about the domain).

Edit: I'd like to do something cross-platform, but for the moment I'd be targeting linux.

Edit 2: Thanks csmba for the well-thought out reply. At this point in time, I'm mainly interested in being able to create applications that allow automation, or execution of different commands through voice. So, a limited amount of recognizable commands being able to be strung together. An example would be a music player that took commands like "Play the album Hello Everything by Squarepusher", or an application launcher that allowed the user to create voice-shortcuts to launch specific apps.

I realize that it's a pretty giant problem, and that I have nowhere near the level of knowledge required right now to tackle implementing an entire recognition engine, although the techniques involved with doing so fascinate me, and it is something I'd like to work myself up to doing. In all likelihood, I'll probably end up picking up a book or two on the subject and studying up / playing with "simple" implementations in my free time.

847

asked Aug 18 '08 16:08

jdd

1 Answers

This is a HUGE questions, I wouldn't know how to begin... So let me just try giving you the right "terms" so you can refine your quest:

First, understand that Speech Recognition is a diverse and complicated subject, and it has many different applications. People tend to map this domain to the first thing that comes to their head (usually, that would be computers understanding what you are saying like in IVR systems). So first lets distinguise the concept into the main categories:

Human-to-Machine: Applications that deal with understanding what a human is saying, but the human knows he is talking to a machine and the grammar is very limited. Examples are

Computer automation
Specialized: Pilots automating some controls for example (noise a huge problem)
IVR (Interactive Voice Response) systems like Google-411 or when you call the bank and the computer on the other side says "say 'service' to get customer service"

human-to-human (Spontaneous speech): This is a bigger, more complex problem. Here we can also break it down into different applciations:

Call Center: conversation between Agent-Customer, phone quality, compressed
Intelligence: radio/phone/live conversations between 2 or more individuals

Now, Speech-To-Text is not what you should be saying that you care about. What you care about is solving a problem. Different technologies are used to solve different problems. See an overview here of some of them. to summarize, other approaches are Phonetic transcription, LVCSR and direct based.

Also, are you interested in being the PHd behind the technology? you would need a Masters equivalent involving Signal processing and probably a PHd to be cutting edge. In which case, you will work for a company that develops the actual speech engine. Companies like Nuance and IBM are the big ones, but also Phillips and other startups exist.

On the other hand, if you want to be the one implementing applications, you will not be working on the engine, but working on building application that USE the engine. A good analogy I think is form the gaming industry: Are you developing the graphic engine (like the Cry engine), or working on one of several hundred games, all use the same graphic engine?

Don't get me wrong, there is plenty to work on the quality of the search also outside the IBM/Nuance of the world. The engine is usually very open, and there are a lot of algorithmic tweaking to be done that can dramatically affect performance. Each business application has different constraints and cost/benefit function, so you can make experiments for many years building better voice recognition based applications.

one more thing: in general, you would also want to have good statistics background the lower in the stack you want to be.

At this point in time, I'm mainly interested in being able to create applications that allow automation

Good, we are converging here... Then you have no interest in "Speech-to-Text". That buzzwords takes you to the world of full transcription, a place you do not need to go to. You should be focusing on some of the more Human-to-Machine technologies like Voice XML and the ones used in IVR systems (Nuance is the biggest player there)

108

answered Sep 16 '22 12:09

csmba

Related questions
                            
                                Data structure for efficiently retrieving the nearest element from a set
                            
                                How is parallelism on a single thread/core possible?
                            
                                The best way to validate XML in a unit test?
                            
                                Algorithm for determining if 2 graphs are isomorphic
                            
                                Best way to store hierarchical tags
                            
                                Which innovations (like MVC, xunit, Hotspot) did Smalltalk bring?
                            
                                Find triplets in better than linear time such that A[n-1] >= A[n] <= A[n+1]
                            
                                Best Way to Begin Learning Web Application Design [closed]
                            
                                What are some tricks that a processor does to optimize code?
                            
                                Data structure for finding nearby keys with similar bitvalues
                            
                                Calculating # or Rows and Columns
                            
                                Fill volume algorithm
                            
                                Isn't polymorphism just a side effect of inheritance?
                            
                                Can an array be grouped more efficiently than sorted?
                            
                                SVN problem: What is the latest revision that still contained this code snippet?
                            
                                How do I convert a stereo wav to mono
                            
                                How to make your web framework popular?
                            
                                Is it advisable to have non-ascii characters in the URL?
                            
                                Why is `try` an explicit keyword?
                            
                                Find the simplest rational number between two given rational numbers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get started with speech-to-text?

Tags:

language-agnostic

speech-recognition

jdd

People also ask

1 Answers

csmba

Recent Activity

Donate For Us