In my particular use case, the IDF-factor that gets calculated as part of the TF-IDF algorithm messes up the scoring for my queries. Basically, I want the queries to only take the term frequency into account. Is it possible to disable the IDF factor, i.e set it to 1, for a particular index? I have looked into the similarity module (in version 0.90.X), but haven't really found anything that could help; same goes for the function_score query. Do I need to write a custom Similarity class in java? Or is there a plugin for what I'm trying to achieve?
The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD.
IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word).
The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).
How is TF-IDF calculated? TF-IDF for a word in a document is calculated by multiplying two different metrics: The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document.
What about constant_score query?
See http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/ignoring-tfidf.html
Don't hesitate to use ?explain=true to see how scoring is working.
As you can here without constant_filter:
And with constant_filter query (that wraps your real query):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With