Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get started with speech-to-text?

I'm really interested in speech-to-text algorithms, but I'm not sure where to start studying up on them. A bunch of searching around led me to this, but it's from 1996 and I'm fairly certain that there have been improvements since then.

Does anyone who has any experience with this sort of stuff have any recommendations for reading / source code to examine? Or just general advice on what I should be trying to learn about if I want to get into the world of writing speech recognition programs (sometimes it's hard to know what to search for if you don't have much knowledge about the domain).

Edit: I'd like to do something cross-platform, but for the moment I'd be targeting linux.

Edit 2: Thanks csmba for the well-thought out reply. At this point in time, I'm mainly interested in being able to create applications that allow automation, or execution of different commands through voice. So, a limited amount of recognizable commands being able to be strung together. An example would be a music player that took commands like "Play the album Hello Everything by Squarepusher", or an application launcher that allowed the user to create voice-shortcuts to launch specific apps.

I realize that it's a pretty giant problem, and that I have nowhere near the level of knowledge required right now to tackle implementing an entire recognition engine, although the techniques involved with doing so fascinate me, and it is something I'd like to work myself up to doing. In all likelihood, I'll probably end up picking up a book or two on the subject and studying up / playing with "simple" implementations in my free time.

like image 847
jdd Avatar asked Aug 18 '08 16:08

jdd


People also ask

How do I train Microsoft dictate?

If you want to retrain your computer to recognize your voice, press the Windows logo key, type Control Panel, and select Control Panel in the list of results. In Control Panel, select Ease of Access > Speech Recognition > Train your computer to better understand you.

How does speak to text work?

To use voice-to-text on Android, open the keyboard in your chat or text app and tap the smartphone icon. Start speaking, and the phone will recognize your message.


1 Answers

This is a HUGE questions, I wouldn't know how to begin... So let me just try giving you the right "terms" so you can refine your quest:

First, understand that Speech Recognition is a diverse and complicated subject, and it has many different applications. People tend to map this domain to the first thing that comes to their head (usually, that would be computers understanding what you are saying like in IVR systems). So first lets distinguise the concept into the main categories:

Human-to-Machine: Applications that deal with understanding what a human is saying, but the human knows he is talking to a machine and the grammar is very limited. Examples are

  • Computer automation
  • Specialized: Pilots automating some controls for example (noise a huge problem)
  • IVR (Interactive Voice Response) systems like Google-411 or when you call the bank and the computer on the other side says "say 'service' to get customer service"

human-to-human (Spontaneous speech): This is a bigger, more complex problem. Here we can also break it down into different applciations:

  • Call Center: conversation between Agent-Customer, phone quality, compressed
  • Intelligence: radio/phone/live conversations between 2 or more individuals

Now, Speech-To-Text is not what you should be saying that you care about. What you care about is solving a problem. Different technologies are used to solve different problems. See an overview here of some of them. to summarize, other approaches are Phonetic transcription, LVCSR and direct based.

Also, are you interested in being the PHd behind the technology? you would need a Masters equivalent involving Signal processing and probably a PHd to be cutting edge. In which case, you will work for a company that develops the actual speech engine. Companies like Nuance and IBM are the big ones, but also Phillips and other startups exist.

On the other hand, if you want to be the one implementing applications, you will not be working on the engine, but working on building application that USE the engine. A good analogy I think is form the gaming industry: Are you developing the graphic engine (like the Cry engine), or working on one of several hundred games, all use the same graphic engine?

Don't get me wrong, there is plenty to work on the quality of the search also outside the IBM/Nuance of the world. The engine is usually very open, and there are a lot of algorithmic tweaking to be done that can dramatically affect performance. Each business application has different constraints and cost/benefit function, so you can make experiments for many years building better voice recognition based applications.

one more thing: in general, you would also want to have good statistics background the lower in the stack you want to be.

At this point in time, I'm mainly interested in being able to create applications that allow automation

Good, we are converging here... Then you have no interest in "Speech-to-Text". That buzzwords takes you to the world of full transcription, a place you do not need to go to. You should be focusing on some of the more Human-to-Machine technologies like Voice XML and the ones used in IVR systems (Nuance is the biggest player there)

like image 108
csmba Avatar answered Sep 16 '22 12:09

csmba