Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does disable_coord parameter for boolean queries mean?

The default value for disable_coord in ES as per documentation is false. I cannot find a detailed explanation for how setting this parameter to true would affect search results.

like image 756
Karan Verma Avatar asked Sep 04 '14 23:09

Karan Verma


People also ask

What is boolean query?

A Boolean search is a query technique that utilizes Boolean Logic to connect individual keywords or phrases within a single query. The term “Boolean” refers to a system of logic developed by the mathematician and early computer pioneer, George Boole.

What is bool query in Elasticsearch?

Boolean, or a bool query in Elasticsearch, is a type of search that allows you to combine conditions using Boolean conditions. Elasticsearch will search the document in the specified index and return all the records matching the combination of Boolean clauses.

What is term query in Elasticsearch?

Term queryedit. Returns documents that contain an exact term in a provided field. You can use the term query to find documents based on a precise value such as a price, a product ID, or a username.

What is Elasticsearch score?

The score represents how relevant a given document is for a specific query. The default scoring algorithm used by Elasticsearch is BM25.


4 Answers

if there are N subqueries in bool query with same boosts/weights then disable_coord=true will follow next logic...

Assume that:

  • all subqueries have the same boost and weight.
  • N -- total number of subqueries.
  • n -- number of subqueries that matched.

When n subqueries are matched: total score will be proportional to sum of boosts/weights of matched queries. As we now assuming equal weights/boosts this will be: Sn = n*const.

When all subqueries are matched (n=N): Smax = N*const

Partial matches compared to full match will be part_of_max = Sn / Smax = (n*const) / (N*const) = n/N

For example if you have 3 subqueries:

  • all subqueries match: total score will be Smax
  • 2 subqueries match: total score will be part_2 = 2/3=0.66 (66%) of Smax.
  • 1 subquery match: total score will be part_1 = 1/3=0.33 (33%) of Smax

Let's compare this to scoring when coordination is enabled (default behaviour of elasticsearch). Long story short: "partial" matches will have much worse score then full matches.

Approximate score will be proportional to sum of weights/boosts of matched subqueries multiplied by n/N. And if boosts/weights are equal then total score will be proportional to Sn₂ = n*n/N * const = n²/N * const

When all subqueries are matched (n=N): Smax₂ = N*(N/N)*const = N * const

Partial matches compared to full match will be part_of_max₂ = Sn₂ / Smax₂ = (n²/N * const) / (N * const) = n²/N²

For example if you have 3 subqueries:

  • all subqueries match: total score will be Smax the same as when coordination is enabled
  • 2 subqueries match: total score will be part_2₂ = 4/9=0.44 (44%) of Smax₂. Or 2/3 smaller (66%) compared to part_2
  • 1 subquery match: total score will be part_1₂ = 1/9=0.11 (11%) of Smax₂. Or 1/3 smaller (33%) compared to part_1

Different coordination approaches compared to each other: scores when disable_coord=False are smaller than scores when disable_coord=True by (n²/N²)/(n/N) = n/N times

Possible usecases for different query types with different coordination policy:

  • "fuller" matches should be much more relevant then partial matches: use default bool query with coordination enabled
  • each of your subqueries are self-sufficient and match of more subqueries is good and "lineary" important: use boold query with disable_coord=True
  • when each of your subqueries a equally important and match of 1 subquery should be treated the same as match of all subqueries: use dis_max query
  • when you are searching in multiple fields and non-overlapping matches in multiple fields is better then same number of matches in single field: use combination of bool and dis_max queries (more details in the docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html)

Note that same subquery may have different score when term appears several times in the document: this is controlled by term_frequency (https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html#tf) -- and it's not affected by disable_coord contary to what's said in another answer (https://stackoverflow.com/a/26998760/952437). Field-length normalization also affects how results are calculated

If you'd like to know how these 3 concepts work together then see following example:

Query: quick brown fox -- this is actually 3 queries combined with "OR"

disable_coord=True:

  • quick brown fox rocks -- Score of ~=3*1/(sqrt(4))*const = 3*tmp_const
  • quick brown fox quick -- Score of ~=(1+1*sqrt(2)+1)*1/(sqrt(4))*const = 3.41 * tmp_const
  • quick brown fox quick fox -- Score of ~=(1+1*sqrt(2)+1*sqrt(2))*1/(sqrt(5))*const = 3.82 * 0.89 tmp_const = 3.42 * tmp_const. One extra fox makes result more relevant but this is compensated by field-length-normalization
  • quick brown bird flies -- Score of ~=2*1/(sqrt(4))*const = 2*tmp_const
  • quick brown bird -- Score of ~=2*1/(sqrt(3))*const = 2*1.1547*tmp_const ~= 2.31*tmp_const
  • fox -- Score of ~=2*1/(sqrt(1))*const = 2*2*tmp_const ~= 4*tmp_const -- score is bigger even compared to quick brown fox quick. This is caused by field length normalization

disable_coord=False:

  • quick brown fox rocks (coord_factor=3/3=1) -- Score of ~=3*1/(sqrt(4))*const = 3*tmp_const
  • quick brown fox quick (coord_factor=3/3=1) -- Score of ~=(1+1*sqrt(2)+1)*1/(sqrt(4))*const = 3.41 * tmp_const
  • quick brown fox quick fox (coord_factor=3/3=1) -- Score of ~=(1+1*sqrt(2)+1*sqrt(2))*1/(sqrt(5))*const = 3.82 * 0.89 tmp_const = 3.42 * tmp_const
  • quick brown bird flies (coord_factor=2/3=0.66) -- Score of ~=2*1/(sqrt(4))*const * 2/3 = 1.33*tmp_const. Lower score thanks to coordination
  • quick brown bird (coord_factor=2/3=0.66) -- Score of ~=2*1/(sqrt(3))*const *2/3 = 2*1.1547*tmp_const * 2/3 ~= 1.54*tmp_const. . Lower score thanks to coordination
  • fox (coord_factor=1/3=0.33) -- Score of ~=2*1/(sqrt(1))*const * 1/3 = 2*2*tmp_const * 1/3 ~= 1.33*tmp_const. Thanks to "coordination" this result is now less relevant than result with all 3 terms

Real score will also depend on IDF (inversed document frequency). Examples above assume that all the terms have the same frequency in the index.

like image 161
imposeren Avatar answered Nov 15 '22 22:11

imposeren


Its is used in the lucene scoring. While scoring the results,

Example If i like to modify the coord score of any bool query such that the entire query will be multiplied by 2 if some particular clause or text or values are matched.

like image 26
Akash Yadav Avatar answered Nov 15 '22 22:11

Akash Yadav


This is coordination factor.

  • if coord factor is enabled (by default "disable_coord": false) then it means: if we have more search keywords in text then this result would be more relevant and will get higher score.

  • if coord factor is disabled("disable_coord": true) then it means: no matter how many keywords we have in search text it will be counted just once.

More details you can find here.

like image 42
Igor Avatar answered Nov 15 '22 20:11

Igor


Suppose you have a should clause in which you have three queries now suppose one document matches first bool query then it will get some score on that but suppose this query do not exactly match second query but partially matches, now this document will be given some little bit score extra that means (first query match score + second query partial match score).

Now if u do not want this partial score to be given in your query then you should write "disable_coord": true what it will do it will only give score to the document according to the exactly match query not on the partial match. I hope you get it now.........:)

like image 44
Sudhanshu Gaur Avatar answered Nov 15 '22 20:11

Sudhanshu Gaur