How do I build a large-vocabulary language model for CMU Sphinx?

Question

I would like to build a language model for CMU Sphinx, but my corpus has more than 1000 words so I cannot use the online tool. How do I use (the scripts in cmuclmtk?) to build my language model?

Nikolay Shmyrev · Accepted Answer

Please read the tutorial

http://cmusphinx.sourceforge.net/wiki/tutoriallm

Tilo · Answer

Not a trivial task. Generating a language model is a time- and resource-intensive task.

If you want to have a "good" language model, you will need a large or very large text corpus to train a language model (think in the order of magnitude of several years of wall street journal texts).

"good" means: if the language model will be able to generalize from the training data to new and previously unseen input data

You should look at the documentation of the Sphinx and the HTK language model toolkits.

http://cmusphinx.sourceforge.net/wiki/tutoriallm

Also check these two threads:

Building openears compatible language model

Ruby Text Analysis

You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.

see: Katz's back-off model

How do I build a large-vocabulary language model for CMU Sphinx?

Tags:

speech-recognition

cmusphinx

joeforker

2 Answers

Nikolay Shmyrev

Tilo

Recent Activity

Donate For Us

How do I build a large-vocabulary language model for CMU Sphinx?

Tags:

speech-recognition

cmusphinx

joeforker

2 Answers

Nikolay Shmyrev

Tilo

Related questions

Recent Activity

Donate For Us