Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does search_after work in elastic search?

Tags:

I have been trying to use Elasticsearch for our application, but the pagination having a limit of 10k is actually an issue for us, and scroll API is also not a recommended choice due to having to time out issue.

I found out Elasticsearch has something called search_after, which is the ideal solution for supporting deep pagination. I have been trying to understand it from docs but its bit confusing and was not able to clearly understand how it works.

Let's assume, I have three columns in my document, id, first_name, last_name, here ID is a unique primary key.

{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"id": "asc"}      
    ]
}

Can I use the above query for using the search_after functionality? I read in their docs that, we have to use multiple unique value in sort rather than just one (ID), but as you know in my dataset I only have ID as unique. What can I do to use search_after for my dataset example?

I was not able to understand the issue stated, if I use one unique tie-breaker for sort? Can someone help to explain this in laymen terms?

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-search-after.html

A field with one unique value per document should be used as the tiebreaker of the sort specification. Otherwise the sort order for documents that have the same sort values would be undefined and could lead to missing or duplicate results. The _id field has a unique value per document but it is not recommended to use it as a tiebreaker directly. Beware that search_after looks for the first document which fully or partially matches tiebreaker’s provided value. Therefore if a document has a tiebreaker value of "654323" and you search_after for "654" it would still match that document and return results found after it. doc value are disabled on this field so sorting on it requires to load a lot of data in memory. Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.

like image 534
user_12 Avatar asked Jun 25 '21 08:06

user_12


1 Answers

In your case, if your id field contains unique values and has the type keyword (or numeric) then you're absolutely fine and can use it to paginate using search_after.

So the first call would be the one you have in your question:

{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"id": "asc"},
        {"score": "desc"}      
    ]
}

In your reponse, you need to look at the last hit and take the sort value from that last hit:

{
    "_index" : "myindex",
    "_type" : "_doc",
    "_id" : "100000012",
    "_score" : null,
    "_source": { ... },
    "sort" : [
      "100000012",                                <--- take this
      "98"                                        <--- take this
    ]
}

Then in your next search call, you'll specify that value in search_after

{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [ "100000012", "98" ],        <--- add this
    "sort": [
        {"id": "asc"}      
    ]
}

And the first hit of the next result set will be id: 100000013. That's it. There's nothing more to it.

The problem you're pointing at does not concern you if you always sort with full id values. The way it works is that you always use the last id value from the previous results. If you were to add "search_after": ["1000"] then you'd have the issue they mention, but there's no reason for you to do it.

like image 183
Val Avatar answered Oct 11 '22 22:10

Val