Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARPA language model documentation

Where can I find documentation on ARPA language model format?

I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons. I want to understand how much can I do to adjust my language model for my custom needs.

All I found is some very brief ARPA format descriptions:

  • http://kered.org/blog/2008-08-12/arpa-language-model-file-format/
  • http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
  • http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html

I am beginner to STT and I have trouble to wrap head around this (n-grams, etc...). I am looking for more detailed docs. Something like documentation on JSGF grammar here:

http://www.w3.org/TR/jsgf/

like image 543
Lukasz Avatar asked May 06 '13 22:05

Lukasz


People also ask

What are ARPA language models?

ARPA language models are essentially “everything is possible” kind of models of the language. Given any sequence of N or less-that-N words, they provide a probability of that sequence being seen in a sufficiently large representative sample of that language. Consider the text wood pittsburgh cindy jean jean wood

What are statistical n-gram models in ARPA?

Statistical N-gram models in the ARPA format. ARPA language models are essentially “everything is possible” kind of models of the language. Given any sequence of N or less-that-N words, they provide a probability of that sequence being seen in a sufficiently large representative sample of that language.

What's new in the ARPA toolkit 2?

Evaluation of ARPA format language models Version 2 of the toolkit includes the ability to calculate perplexities of ARPA format language models. Handling of context cues In version 1, the tags <s>, <p>, and <art>were all hard-wired to represent context cues, and the tag <s>was required to be in the vocabulary.

What is the context cues file in ARPA?

The ARPA format language model does not contain information as to which words are context cues, so if an ARPA format lanaguage model is used, then a context cuesfile may be specified as well. Output: The program can run in one of two modes.


1 Answers

There is actually not much more to say about the format than is said in those docs..

Besides, you'll probably want to prepare a text file with sample sentences and generate the language file based on it. There is an online version which can do it for you: lmtool

like image 52
Dariusz Avatar answered Oct 05 '22 04:10

Dariusz