Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch pagination

What is the best way to do a pagination using Elasticsearch? Currently, I am working on an API that uses Elasticsearch in the backend with Python, and my index does not have much data, so by default we are doing the pagination in the frontend using JavaScript (and so far, we have do not have any problems).

I want to know for bigger indexes, what is the best way to handle pagination:

  • Scroll API
  • Sliced Scroll
  • search_after
like image 322
Andrex Avatar asked Dec 18 '22 15:12

Andrex


1 Answers

The default way of paginating over search results in Elasticsearch is using from/size parameters. This will, however, work only for the top 10k search results.

In case you need to go above that the way to go is search_after.

In case you need to dump the entire index, and it contains more than 10k documents, use scroll API.

What's the difference?

All of these queries allow to retrieve portions of search results, but they have major differences.

from/size is the cheapest and fastest, it is what Google would use to go for the second, third, etc. search results pages if it used Elasticsearch.

Scroll API is expensive, because it creates a kind of snapshot of the index the moment you create the first query, to make sure by the end of the scroll you will have exactly the data that was present in the index at the start. Doing a scroll request will cost resources, and running many of them in parallel can kill your performance, so proceed with caution.

Search after instead is a half-way between the two:

search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.

So it will allow you to paginate above 10k, with a cost of some possible inconsistency.

Why the 10k limit?

index.max_result_window is set to 10k as a hard limit to avoid out of memory situations:

index.max_result_window

The maximum value of from + size for searches to this index. Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory.

What about sliced scroll?

Sliced scroll is just a faster way of doing a normal scroll: it allows to download the collection of documents in parallel. Slice is just a subset of documents in the scroll query output.

like image 141
Nikolay Vasiliev Avatar answered Dec 28 '22 09:12

Nikolay Vasiliev