While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:
Solr: cursor
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
ElasticSearch: scroll
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context
Now I read those pages and searched the internet, but I'm still a bit clueless at some points:
cursor
/ scroll
timeouts (garbage collection):
cursor
token). That's basically just a question about possible memory leaks, etc.scroll=1m
.backwards pagination:
cursor
token for each request, so it is possible to access any previous page.scroll
token. So I cannot go backwards without doing a new search? Alter search query:
scroll
queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...
). So there's no possibility to alter the search query.cursor
token to the normal query. Does this mean, that I can use some cursor
token and change the query (filters, ordering, page size, etc.)? Index changes while using scroll
/ cursor
:
Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:
cursor
token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?cursor
token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor
token sequence? ElasticSearch documentation says nothing about what happens if the index changes while using scroll
. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.
How can this be faster than simple size=10&from=10
/ rows=5&start=0
?
More kinda technical question, just because I'd like to understand what happens under the hood.
cursor
thing more efficient than normal pagination using start
and rows
. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor
. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?Would be cool, if someone could give me some explanation how things work.
Thanks in advance! :)
1 Ingest and Query services. The Elasticsearch query process is structured very similarly to the Solr service. The main difference lies in the microservice architecture of the system, and the exits to the Elasticsearch and the ZooKeeper administrative functions, rather than to Solr and the monolithic search server.
The scroll parameter indicates how long Elasticsearch should retain the search context for the request. The search response returns a scroll ID in the _scroll_id response body parameter. You can then use the scroll ID with the scroll API to retrieve the next batch of results for the request.
If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.
The default scoring algorithm used by Elasticsearch is BM25. There are three main factors that determine a document's score: Term frequency (TF) — The more times that a search term appears in the field we are searching in a document, the more relevant that document is.
Solr's cursor
and start
both function like open-ended range queries, with cursor
operating like a less-than range query on score and start
operating like a greater-than range query on rank. cursor
is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N
must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.
Elasticsearch's scroll
functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.
You can partially emulate the behavior of Solr's cursor
in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With