Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dedup elasticsearch results using multiple fields as unique key

There have been similar question asked to this (see Remove duplicate documents from a search in Elasticsearch) but I haven't found a way to dedup using multiple fields as the "unique key". Here's a simple example to illustrate a bit of what I'm looking for:

Say this is our raw data:

{ "name": "X", "event": "A", "time": 1 }
{ "name": "X", "event": "B", "time": 2 }
{ "name": "X", "event": "B", "time": 3 }
{ "name": "Y", "event": "A", "time": 4 }
{ "name": "Y", "event": "C", "time": 5 }

I would essentially like to get the distinct event counts based on name and event. I want to avoid double counting the event B which happened on the same name X twice, so the counts I'd be looking for are:

event: A, count: 2
event: B, count: 1
event: C, count: 1

Is there a way to set up an agg query as seen in the related question? Another option I've deliberated is to index the object with a special key field (i.e. "X_A", "X_B", etc.). I could then simply dedup on this field. I'm not sure which is a preferred approach, but I'd personally prefer not to index the data with extra metadata.

like image 655
Shark Avatar asked Sep 21 '16 00:09

Shark


1 Answers

You can specify a script in a terms aggregation in order to build a key out of multiple fields:

POST /test/dedup/_search
{
  "aggs":{
    "dedup" : {
      "terms":{
        "script": "[doc.name.value, doc.event.value].join('_')"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
       }    
    }
  }
}

This will basically provide the following results:

  • X_A: 1
  • X_B: 2
  • Y_A: 1
  • Y_C: 1

Note: There's only one event C in your sample data, so the count cannot be two unless I'm missing something.

like image 198
Val Avatar answered Oct 14 '22 17:10

Val