I am studying the Okapi BMS25 model. I understand everything but two confusion. While calculating document length (dl) and average document length (avdl). I found the document length is
So it is a summation of my keywords/terms in a particular document. But when I see wiki's def:
So |D| is the length of the document D in words (i.e. is summation of total words count). Now, the question what is dl actually?
Now, second question how to calculate avdl? (just calculating (doc1+doc2+...N)/N where N is my total no documents in collection? (and avdl is fixed for whole collection?)
According the Joaquín Pérez-Iglesias in Integrating the Probabilistic Model BM25/BM25F into Lucene, the score function R should be defined as followed :
such as
occurs_t^d
is the term frequency of t
in d
,l_d
is the document d
length.avl_d
is the document average length along the collectionk_1
is a free parameter usually 2 and b
in [0,1] (usually 0.75). Assigning 0 to b
is equivalent to avoid the process of normalisation and therefore the document length will not affect the final score.
If b
takes 1, we will be carrying out a full length normalisation.
where N
is the number of document in the collection and df
is the number of documents where appears the term t
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With