Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High level explanation of Similarity Class for Lucene?

Do you know where I can find a high level explanation of Lucene Similarity Class algorithm. I will like to understand it without having to decipher all the math and terms involved with searching and indexing.

like image 223
Geo Avatar asked Mar 17 '09 18:03

Geo


1 Answers

Lucene's built-in Similarity is a fairly standard "Inverse Document Frequency" scoring algorithm. The Wikipedia article is brief, but covers the basics. The book Lucene in Action breaks down the Lucene formula in more detail; it doesn't mirror the current Lucene formula perfectly, but all of the main concepts are explained.

Primarily, the score varies with number of times that term occurs in the current document (the term frequency), and inversely with the number of times a term occurs in all documents (the document frequency). The other factors in the formula are secondary, adjusting the score in attempt to make scores from different queries fairly comparable to each other.

like image 131
erickson Avatar answered Sep 21 '22 23:09

erickson