Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct way to calculate probabilities using ARPA LM data

I am writing a small library for calculating ngram probabilities.

I have a LM described by arpa file (its a quite simple format: probability ngram backoff_weight):

...
-5.1090264  Hello   -0.05108307
-5.1090264  Bob -0.05108307
-3.748848   we -0.38330063
...
-2.5558481  Hello Bob   -0.012590006
...
-1.953679   Hello Bob how   -0.0022290824
...
-0.58411354 Hello Bob how are   -0.0007929117
...
-1.4516809  Hello Bob how are you
...

But how do I calculate P(we|Hello Bob how are) here correctly?

P(we|Hello Bob how are) = P(we) * BWt(Hello Bob how are) ?

or is this the right way:

P(we|Hello Bob how are) = P(are we) * BWt(Hello Bob how) ?

what if I don't have backoff weight for the 4-gram (Hello Bob how are) ?

Please point me to some universal formula for calculating the probabilities or where can I read it, I really can't find anything good somehow...

like image 256
Bob Avatar asked Oct 23 '25 05:10

Bob


1 Answers

If a LM is like this

...
\1-grams:
p1 word1 bw1
\2-grams:
p2 word1 word2 bw2
p4 word2 word3 bw4
\3-grams:
p3 word1 word2 word3 bw3
...

How to calculate P(word3 | word1, word2)?

if(exist(word1, word2, word3)):
    P(word3 | word1, word2) = p3
    return P(word3 | word1, word2)
else if(exist(word1, word2)):
    bw(word1, word2) = bw2
    P(word3 | word2) = p4
    return bw(word1, word2) * P(word3 | word2)
else:
    P(word3 | word2) = p4
    return P(word3 | word2)

When a ngrams doesn't exist in the corpus, we need to back off to a lower-order ngrams.

If the backoff weight doesn't exist, it means the backoff weight equals to 1 (log10(bw)==0)

like image 145
yanshengjia Avatar answered Oct 25 '25 14:10

yanshengjia



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!