I come across at least two possible ways to fetch the results in batches .
Scroll API
Pagination - From , Size parameters
What is the fundamental difference ? I am assuming #1 allows to scroll over the records while #2 allows you to fetch a batch of records at a time . If i just use different From , Size parameters to drive pagination, are there chances where the same record will be returned in different batches?
Using from/size is the default and easiest way to paginate results. By default, it only works up to a size of 10000. You can increase that limit, but it is not advised to go too far because deep pagination will decrease the performance of your cluster.
The scroll API will allow you to paginate over all your data. The way it works is by creating a search context (i.e. a snapshot of the data at the time your start scrolling) and then you'll get a cursor to paginate over all your data. When done, you can close the search context. The created search context has an associated cost (requires state, hence memory), hence this way of paginating is not suited to real-time pagination (more for batch-like pagination).
There is another way of scrolling over all the data without the additional cost of creating a dedicated search context every time, and it's called search_after
. In this flavor, the idea is to sort your data, and then use the sort values as lightweight cursors. It can have some drawbacks, for instance, if you're constantly indexing new data, you might run the risk of missing new data that would have appeared on a previous "page".
In 7.10, there is going to be yet another way of paginating data, which is called Point in Time search (PIT). Here the idea is again to create a context so that you can return hits as rapidly as possible and aggregations (a bit later) in two distinct calls.
UPDATE
7.10 got released on Nov 11th, 2020, and Point in Time searches are now available, too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With