Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database. Elasticsearch provides methods to paginate search results with handy <code>from</code> and <code>to</code> parameters. So I run a query <code>get me the most recent data from result 1 to 10</code> This works great. The user clicks "next page" and the query is: <code>get me the most recent data from result 11 to 20</code> The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page). What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.

A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems. The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing. You begin a Scrolling Search by supplying your query and the <code>scroll</code> parameter, for which Elasticsearch returns a <code>scroll_id</code>. You then make requests to <code>/_search/scroll</code> supplying that ID, each of which return a page of results and a new <code>scroll_id</code> for the next request. (Note that you don't want the <code>scan</code> search type here. That's used to extract documents en masse, and does not apply any sorting.) Compared to filtering, you do still have to track a value: the <code>scroll_id</code> for your next page of results. Whether that's easier than tracking a timestamp depends on your app. There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart. The ES documentation for the Scroll API provides good details on all of the above. Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.

How to handle pagination when the source data changes frequently

Tags:

pagination

elasticsearch

paging

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.

Elasticsearch provides methods to paginate search results with handy from and to parameters.

So I run a query get me the most recent data from result 1 to 10

This works great.

The user clicks "next page" and the query is: get me the most recent data from result 11 to 20

The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).

What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.

470

asked Jan 15 '15 17:01

bradvido

2 Answers

A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.

The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.

You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.

(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)

Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.

There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.

The ES documentation for the Scroll API provides good details on all of the above.

Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.

132

answered Nov 13 '22 05:11

Nick Zadrozny

Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html

It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.

answered Nov 13 '22 05:11

heylookalive

Related questions
                            
                                ElasticSearch: Using output of one query as input to another
                            
                                Elasticsearch - boost nested query with higher value
                            
                                Configuring elasticsearch in JHipster project using prod yml
                            
                                Elasticsearch High Level Rest Client - java Map with typed (sub) fields - dates, numbers etc
                            
                                What advantages does MongoDB offer over ElasticSearch as a NoSQL database only
                            
                                Elasticsearch word frequency and relations
                            
                                Elasticsearch as a service for GCP
                            
                                ElasticSearch additional facet data
                            
                                Return key value from alternate field in aggregation
                            
                                Elasticsearch.net - Range Query
                            
                                ElasticSearch post_filter and filtered aggregations not behaving the same way
                            
                                How can we deal with NULL values that have specific meanings?
                            
                                Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200) (Faraday::ConnectionFailed)
                            
                                Setting up Flashlight on Heroku for ElasticSearch with new Firebase
                            
                                How to get elastic search to play with MongoDb and node.js?
                            
                                Elasticsearch out of sync when overwhelmed on HTTP at test suite
                            
                                How to write date range query in Nest ElasticSearch client?
                            
                                Elasticsearch score disable IDF
                            
                                Fulltext Search DynamoDB
                            
                                Logging using elasticsearch-py

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With