Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Elasticsearch 7 track_total_hits improve query speed?

I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.

Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.

But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?

Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."

Hopefully someone can explain how things work under the hood and which important point I am missing.

like image 525
Benjamin M Avatar asked Feb 28 '21 22:02

Benjamin M


1 Answers

There are at least two different contexts in which not all documents need to be sorted:

A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.

B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.

If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.

If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.

Relevant links:

  • https://github.com/elastic/elasticsearch/pull/24864
  • https://github.com/elastic/elasticsearch/issues/33028
  • https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
like image 109
Val Avatar answered Sep 24 '22 09:09

Val