I have an ElasticSearch index with lots of documents in it. There are roughly 20 fields on the data model; of these, there are 5 that, if they are the same, would lead me to conclude that the document is a duplicate. So basically, I want to group documents that have the same values in all 5 fields, and return the documents in each resulting bucket (not just aggregated values).
Can ElasticSearch do this?
So the short answer is yes, elasticsearch can definitely do this, the how can be seen with the following short example:
{
"filtered": {
"query": {
// Your query goes here
}
},
"filter": {
"script": {
"script": "doc['field1'].value == doc['field2'].value == doc['field3'].value == doc['field4'].value"
}
}
}
}
I've only tried this with 2 fields but I think it should work for more than that as well.
You're basically using filters to remove documents where those fields aren't all equal to each other. Hopefully this helps.
In case you want to match documentA with documentB and see if 5 of their fields are the same then that would be a different problem.
To solve that problem my suggestion would be to write a script which gets one document at a time and then do an elasticsearch query filtering on the fields you're looking for and see if any other documents turn up. If they do, remove them and repeat the process. Move on the next document if there are no matches. When there are no more documents to check, you're done. (you might want to keep a document counter or list of document names to keep track of when you're done)
This is probably not the clean elasticsearch approach you were looking for and there might be a better way but this is one way to solve your problem.
Try using following steps.
As you can see at least 5 fields should have that value for a document to be returned. Have a look in the first example here
To get complete document inside a bucket use top hits aggregation as explained here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With