Difference(s) between Solr's Cursor and ElasticSearch's Scroll

Tags:

While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:

Solr: cursor https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
ElasticSearch: scroll http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context

Now I read those pages and searched the internet, but I'm still a bit clueless at some points:

cursor / scroll timeouts (garbage collection):
1. Solr documentations doesn't seem to provide a way for setting a timeout (or some special query to invalidate a cursor token). That's basically just a question about possible memory leaks, etc.
2. ElasticSearch provides a timeout setting via scroll=1m.
backwards pagination:
1. Solr will provide a cursor token for each request, so it is possible to access any previous page.
2. ElasticSearch seems to use always the same scroll token. So I cannot go backwards without doing a new search?
Alter search query:
1. ElasticSearch explicitly requires to use a special URL for scroll queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...). So there's no possibility to alter the search query.
2. Solr appends the cursor token to the normal query. Does this mean, that I can use some cursor token and change the query (filters, ordering, page size, etc.)?
Index changes while using scroll / cursor:
1. Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:
  1. What happens if I use the cursor token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?
  2. Basically the same question as before: Solr documentation says: the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress. If I use an old cursor token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor token sequence?
2. ElasticSearch documentation says nothing about what happens if the index changes while using scroll. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.
How can this be faster than simple size=10&from=10 / rows=5&start=0?
More kinda technical question, just because I'd like to understand what happens under the hood.
- I just wondered how (especially) Solr can do this cursor thing more efficient than normal pagination using start and rows. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?

EDIT:

ElasticSearch documentation says "A scrolled search takes a snapshot in time — it doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old datafiles around, so that it can preserve its “view” on what the index looked like at the time it started." So there's still the question: How does Solr handle this?

Would be cool, if someone could give me some explanation how things work.

Thanks in advance! :)

708

asked Aug 03 '14 13:08

Benjamin M

1 Answers

Solr's cursor and start both function like open-ended range queries, with cursor operating like a less-than range query on score and start operating like a greater-than range query on rank. cursor is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.

Elasticsearch's scroll functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.

You can partially emulate the behavior of Solr's cursor in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.

140

answered Oct 13 '22 01:10

Chris Wendt

Related questions
                            
                                ElasticSearch index unix timestamp
                            
                                is there a way to deserialize Elasticsearch Nest search query?
                            
                                FIELDDATA Data is too large
                            
                                filter empty array fields in elasticsearch
                            
                                Boosting in Elasticsearch
                            
                                logstash - Exception in thread ">output" org.elasticsearch.discovery.MasterNotDiscoveredException: waited for [30s]
                            
                                Where are data files of elasticsearch on a standard debian install?
                            
                                What is the difference between an elastic search index and an index in a relational database?
                            
                                Indexing website/url in Elastic Search
                            
                                Sorting on elastic search with node js
                            
                                Windows: curl with json data on the command line
                            
                                Spring Boot Elasticsearch Configuration
                            
                                How to setup Elasticsearch client nodes?
                            
                                Elasticsearch the terms filter raise "filter does not support [mediatest]"
                            
                                ElasticSearch health check failed every time when spring boot start up
                            
                                Elasticsearch match list against field
                            
                                How to fix the URI does not specify a valid host name in ClientProtocolException:
                            
                                JAVA not in path although JAVA_HOME set
                            
                                TOO_MANY_REDIRECTS error when iFraming Kibana dashboard using cookies
                            
                                Elasticsearch: Levenshtein sorting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference(s) between Solr's Cursor and ElasticSearch's Scroll

Tags:

pagination

solr

lucene

elasticsearch

EDIT:

Benjamin M

People also ask

1 Answers

Chris Wendt

Recent Activity

Donate For Us