Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference(s) between Solr's Cursor and ElasticSearch's Scroll

While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:

  • Solr: cursor https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

  • ElasticSearch: scroll http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context

Now I read those pages and searched the internet, but I'm still a bit clueless at some points:

  • cursor / scroll timeouts (garbage collection):

    1. Solr documentations doesn't seem to provide a way for setting a timeout (or some special query to invalidate a cursor token). That's basically just a question about possible memory leaks, etc.
    2. ElasticSearch provides a timeout setting via scroll=1m.

  • backwards pagination:

    1. Solr will provide a cursor token for each request, so it is possible to access any previous page.
    2. ElasticSearch seems to use always the same scroll token. So I cannot go backwards without doing a new search?

  • Alter search query:

    1. ElasticSearch explicitly requires to use a special URL for scroll queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...). So there's no possibility to alter the search query.
    2. Solr appends the cursor token to the normal query. Does this mean, that I can use some cursor token and change the query (filters, ordering, page size, etc.)?

  • Index changes while using scroll / cursor:

    1. Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:

      1. What happens if I use the cursor token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?
      2. Basically the same question as before: Solr documentation says: the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress. If I use an old cursor token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor token sequence?

    2. ElasticSearch documentation says nothing about what happens if the index changes while using scroll. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.

  • How can this be faster than simple size=10&from=10 / rows=5&start=0?
    More kinda technical question, just because I'd like to understand what happens under the hood.

    • I just wondered how (especially) Solr can do this cursor thing more efficient than normal pagination using start and rows. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?

EDIT:

  • ElasticSearch documentation says "A scrolled search takes a snapshot in time — it doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old datafiles around, so that it can preserve its “view” on what the index looked like at the time it started." So there's still the question: How does Solr handle this?

Would be cool, if someone could give me some explanation how things work.

Thanks in advance! :)

like image 708
Benjamin M Avatar asked Aug 03 '14 13:08

Benjamin M


People also ask

What is the main architectural difference between Elasticsearch and SOLR?

1 Ingest and Query services. The Elasticsearch query process is structured very similarly to the Solr service. The main difference lies in the microservice architecture of the system, and the exits to the Elasticsearch and the ZooKeeper administrative functions, rather than to Solr and the monolithic search server.

What is elastic scrolling?

The scroll parameter indicates how long Elasticsearch should retain the search context for the request. The search response returns a scroll ID in the _scroll_id response body parameter. You can then use the scroll ID with the scroll API to retrieve the next batch of results for the request.

How pagination works in Elasticsearch?

If a search request results in more than ten hits, ElasticSearch will, by default, only return the first ten hits. To override that default value in order to retrieve more or fewer hits, we can add a size parameter to the search request body.

What algorithm does Elasticsearch use?

The default scoring algorithm used by Elasticsearch is BM25. There are three main factors that determine a document's score: Term frequency (TF) — The more times that a search term appears in the field we are searching in a document, the more relevant that document is.


1 Answers

Solr's cursor and start both function like open-ended range queries, with cursor operating like a less-than range query on score and start operating like a greater-than range query on rank. cursor is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.

Elasticsearch's scroll functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.

You can partially emulate the behavior of Solr's cursor in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.

like image 140
Chris Wendt Avatar answered Oct 13 '22 01:10

Chris Wendt