Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tune a Machine Translation model with huge language model?

Moses is a software to build machine translation models. And KenLM is the defacto language model software that moses uses.

I have a textfile with 16GB of text and i use it to build a language model as such:

bin/lmplz -o 5 <text > text.arpa

The resulting file (text.arpa) is 38GB. Then I binarized the language model as such:

bin/build_binary text.arpa text.binary

And the binarized language model (text.binary) grows to 71GB.

In moses, after training the translation model, you should tune the weights of the model by using MERT algorithm. And this can simply be done with https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.pl.

MERT works fine with small language model but with the big language model, it takes quite some days to finish.

I did a google search and found KenLM's filter, which promises to filter the language model to a smaller size: https://kheafield.com/code/kenlm/filter/

But i'm clueless as to how to make it work. The command help gives:

$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file

copy mode just copies, but makes the format nicer for e.g. irstlm's broken
    parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel.  Each sentence is on
    a separate line.  A separate file is created for each sentence by appending
    the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
    multiple mode.

context means only the context (all but last word) has to pass the filter, but
    the entire n-gram is output.

phrase means that the vocabulary is actually tab-delimited phrases and that the
    phrases can generate the n-gram when assembled in arbitrary order and
    clipped.  Currently works with multiple or union mode.

The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
    text.  This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.

threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading.  Expect memory usage from this
    of 2*threads*batch_size n-grams.

There are two inputs: vocabulary and model.  Either may be given as a file
    while the other is on stdin.  Specify the type given as a file using
    vocab: or model: before the file name.  

For ARPA format, the output must be seekable.  For raw format, it can be a
    stream i.e. /dev/stdout

But when I tried the following, it gets stuck and does nothing:

$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

What should one do to the Language Model after binarization? Is there any other steps to manipulate large language models to reduce the computing load when tuning?

What is the usual way to tune on a large LM file?

How to use KenLM's filter?

(more details on https://www.mail-archive.com/[email protected]/msg12089.html)

like image 368
alvas Avatar asked Nov 09 '22 15:11

alvas


1 Answers

Answering how to use filter command of KenLM

cat small_vocabulary_one_word_per_line.txt \
  | filter single \
         "model:LM_large_vocab.arpa" \
          output_LM_small_vocab.

Note: that single can be replace with union or copy. Read more in the help which is printing if you run the filter binary without arguments.

like image 181
Oplatek Avatar answered Nov 15 '22 09:11

Oplatek