Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter elasticsearch results to contain only unique documents based on one field value

All my documents have a uid field with an ID that links the document to a user. There are multiple documents with the same uid.

I want to perform a search over all the documents returning only the highest scoring document per unique uid.

The query selecting the relevant documents is a simple multi_match query.

like image 264
TheHippo Avatar asked Oct 22 '14 13:10

TheHippo


People also ask

How do I get unique values in a field in Elasticsearch?

The solution recommended by elasticsearch for this situation is to use a composite aggregation. Advantages of using a composite aggregation: Allows you to paginate and scroll through all the unique values. You will not need to know how many unique values are present before hand.

How do I select a specific field in Elasticsearch?

There are two recommended methods to retrieve selected fields from a search query: Use the fields option to extract the values of fields present in the index mapping. Use the _source option if you need to access the original data that was passed at index time.

What is the Elasticsearch query to get all documents from an index?

Elasticsearch will get significant slower if you just add some big number as size, one method to use to get all documents is using scan and scroll ids. The results from this would contain a _scroll_id which you have to query to get the next 100 chunk. This answer needs more updates. search_type=scan is now deprecated.


2 Answers

You need a top_hits aggregation.

And for your specific case:

{
  "query": {
    "multi_match": {
      ...
    }
  },
  "aggs": {
    "top-uids": {
      "terms": {
        "field": "uid"
      },
      "aggs": {
        "top_uids_hits": {
          "top_hits": {
            "sort": [
              {
                "_score": {
                  "order": "desc"
                }
              }
            ],
            "size": 1
          }
        }
      }
    }
  }
}

The query above does perform your multi_match query and aggregates the results based on uid. For each uid bucket it returns only one result, but after all the documents in the bucket were sorted based on _score in descendant order.

like image 115
Andrei Stefan Avatar answered Sep 29 '22 03:09

Andrei Stefan


In ElasticSearch 5.3 they added support for field collapsing. You should be able to do something like:

GET /_search
{
  "query": {
    "multi_match" : {
      "query":    "this is a test", 
      "fields": [ "subject", "message", "uid" ] 
    }
  },
  "collapse" : {
    "field" : "uid" 
  },
  "size": 20,
  "from": 100
}

The benefit of using field collapsing instead of a top hits aggregation is that you can use pagination with field collapsing.

like image 26
Chase Avatar answered Sep 29 '22 03:09

Chase