Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch documentation says not to use scroll for user requests, only for data transformation

I'm new to ES and confused by its documentation of scroll. From the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of of one index into a new index with a different configuration".

And yet...further down on the same page it says not to use from() and size() to do pagination because it "is very inefficient". And on the Java API page describing Search it shows an example of paging via Scroll.

So, assuming I want to present sorted search results, a page at a time, which approach is recommended: from/size or Scrolling?

like image 822
Brian Tarbox Avatar asked Aug 22 '14 20:08

Brian Tarbox


2 Answers

from/size is very inefficient when you want to do deep pagination or if you want to request lots of results by page.

The reason is that results are sorted first on each shard, and all those results are then gathered, merged and sorted by the request coordinator node. This become more and more costly as the pages grow either in size or in rank. You will find a very good example documented here.

You could limit the size of your users' queries (e.g. to something like ~1000 results), and you will be fine using from/size.

If it's not an option, you can still use scroll, but you will lose some features like aggregations and keeping the search context alive has a cost.

like image 96
ThomasC Avatar answered Oct 08 '22 04:10

ThomasC


You can use search_after. The basic process flow will be like this:

  1. Perform your regular search to return an array of sorted document results by date.
  2. Perform the next query with the search_after field in the body to tell Elasticsearch to only return documents after the specified document (date).

This way, your results remain robust against any updates or document deletions and stay accurate. You also avoid the scrolling costs (as you've likely already read) and the from/size method's linear time operation cost for each query starting from your initial document result.

See the docs for more info and implementation details.

like image 45
writofmandamus Avatar answered Oct 08 '22 04:10

writofmandamus