There have been similar question asked to this (see Remove duplicate documents from a search in Elasticsearch) but I haven't found a way to dedup using multiple fields as the "unique key". Here's a simple example to illustrate a bit of what I'm looking for:
Say this is our raw data:
{ "name": "X", "event": "A", "time": 1 }
{ "name": "X", "event": "B", "time": 2 }
{ "name": "X", "event": "B", "time": 3 }
{ "name": "Y", "event": "A", "time": 4 }
{ "name": "Y", "event": "C", "time": 5 }
I would essentially like to get the distinct event counts based on name and event. I want to avoid double counting the event B which happened on the same name X twice, so the counts I'd be looking for are:
event: A, count: 2
event: B, count: 1
event: C, count: 1
Is there a way to set up an agg query as seen in the related question? Another option I've deliberated is to index the object with a special key field (i.e. "X_A", "X_B", etc.). I could then simply dedup on this field. I'm not sure which is a preferred approach, but I'd personally prefer not to index the data with extra metadata.
You can specify a script in a terms
aggregation in order to build a key out of multiple fields:
POST /test/dedup/_search
{
"aggs":{
"dedup" : {
"terms":{
"script": "[doc.name.value, doc.event.value].join('_')"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
This will basically provide the following results:
Note: There's only one event C
in your sample data, so the count cannot be two unless I'm missing something.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With