Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating ARPA language model file with 50,000 words

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?

like image 622
Vipin Avatar asked Apr 21 '11 11:04

Vipin


1 Answers

I thought I'd answer this one since it has a few votes, although based on Christina's other questions I don't think this will be a usable answer for her since a 50,000-word language model almost certainly won't have an acceptable word error rate or recognition speed (or most likely even function for long) with in-app recognition systems for iOS that use this format of language model currently, due to hardware constraints. I figured it was worth documenting it because I think it may be helpful to others who are using a platform where keeping a vocabulary this size in memory is more of a viable thing, and maybe it will be a possibility for future device models as well.

There is no web-based tool I'm aware of like the Sphinx Knowledge Base Tool that will munge a 50,000-word plaintext corpus and return an ARPA language model. But, you can obtain an already-complete 64,000-word DMP language model (which can be used with Sphinx at the command line or in other platform implementations in the same way as an ARPA .lm file) with the following steps:

  1. Download this language model from the CMU speech site:

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20HUB4%20Language%20Model/HUB4_trigram_lm.zip

In that folder is a file called language_model.arpaformat.DMP which will be your language model.

  1. Download this file from the CMU speech site, which will become your pronunciation dictionary:

https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

Convert the contents of cmu07a.dic to all uppercase letters.

If you want, you could also trim down the pronunciation dictionary by removing any words from it which aren't found in the corpus language_model.vocabulary (this would be a regex problem). These files are intended for use with one of the Sphinx English-language acoustic models.

If the desire to use a 50,000-word English language model is driven by the idea of doing some kind of generalized large vocabulary speech recognition and not by the need to use a very specific 50,000 words (for instance, something specialized like a medical dictionary or 50,000-entry contact list), this approach should give those results if the hardware can handle it. There are probably going to be some Sphinx or Pocketsphinx settings that will need to be changed which will optimize searches through this size of model.

like image 111
Halle Avatar answered Sep 22 '22 01:09

Halle