Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disable IDF calculation

In my particular use case, the IDF-factor that gets calculated as part of the TF-IDF algorithm messes up the scoring for my queries. Basically, I want the queries to only take the term frequency into account. Is it possible to disable the IDF factor, i.e set it to 1, for a particular index? I have looked into the similarity module (in version 0.90.X), but haven't really found anything that could help; same goes for the function_score query. Do I need to write a custom Similarity class in java? Or is there a plugin for what I'm trying to achieve?

like image 453
GlurG Avatar asked Jan 19 '14 20:01

GlurG


People also ask

How is IDF value calculated?

The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD.

What is IDF value?

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word).

Why are we adding 1 in the numerator and denominator in the IDF formula?

The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

How is IDF and TF calculated in vector model?

How is TF-IDF calculated? TF-IDF for a word in a document is calculated by multiplying two different metrics: The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document.


1 Answers

What about constant_score query?

See http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/ignoring-tfidf.html

Don't hesitate to use ?explain=true to see how scoring is working.

As you can here without constant_filter:

With IDF

And with constant_filter query (that wraps your real query):

Without IDF

  • Screenshots made with https://beemapp.me
like image 170
Thomas Decaux Avatar answered Oct 17 '22 17:10

Thomas Decaux