Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Average Document Length in Okapi BM25

I am studying the Okapi BMS25 model. I understand everything but two confusion. While calculating document length (dl) and average document length (avdl). I found the document length is

enter image description here

So it is a summation of my keywords/terms in a particular document. But when I see wiki's def:

enter image description here

So |D| is the length of the document D in words (i.e. is summation of total words count). Now, the question what is dl actually?

Now, second question how to calculate avdl? (just calculating (doc1+doc2+...N)/N where N is my total no documents in collection? (and avdl is fixed for whole collection?)

like image 974
Nusrat Avatar asked Apr 18 '14 20:04

Nusrat


1 Answers

According the Joaquín Pérez-Iglesias in Integrating the Probabilistic Model BM25/BM25F into Lucene, the score function R should be defined as followed :

enter image description here

such as

  • occurs_t^d is the term frequency of t in d,
  • l_d is the document d length.
  • avl_d is the document average length along the collection
  • k_1 is a free parameter usually 2 and b in [0,1] (usually 0.75).

Assigning 0 to b is equivalent to avoid the process of normalisation and therefore the document length will not affect the final score.

If b takes 1, we will be carrying out a full length normalisation.

enter image description here

where N is the number of document in the collection and df is the number of documents where appears the term t.

like image 110
eliasah Avatar answered Mar 20 '23 08:03

eliasah