Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch: Finding documents with multiple identical fields

I have an ElasticSearch index with lots of documents in it. There are roughly 20 fields on the data model; of these, there are 5 that, if they are the same, would lead me to conclude that the document is a duplicate. So basically, I want to group documents that have the same values in all 5 fields, and return the documents in each resulting bucket (not just aggregated values).

Can ElasticSearch do this?

like image 795
SuperNES Avatar asked Oct 27 '25 19:10

SuperNES


2 Answers

So the short answer is yes, elasticsearch can definitely do this, the how can be seen with the following short example:

{
  "filtered": {
  "query": {
    // Your query goes here
    }
  },
  "filter": {
    "script": {
      "script": "doc['field1'].value ==  doc['field2'].value ==  doc['field3'].value ==  doc['field4'].value"
    }
  }
 }
}

I've only tried this with 2 fields but I think it should work for more than that as well.

You're basically using filters to remove documents where those fields aren't all equal to each other. Hopefully this helps.

In case you want to match documentA with documentB and see if 5 of their fields are the same then that would be a different problem.

To solve that problem my suggestion would be to write a script which gets one document at a time and then do an elasticsearch query filtering on the fields you're looking for and see if any other documents turn up. If they do, remove them and repeat the process. Move on the next document if there are no matches. When there are no more documents to check, you're done. (you might want to keep a document counter or list of document names to keep track of when you're done)

This is probably not the clean elasticsearch approach you were looking for and there might be a better way but this is one way to solve your problem.

like image 90
Vishal Rao Avatar answered Oct 29 '25 09:10

Vishal Rao


Try using following steps.

  1. Filter out all the distinct values across all the fields using terms aggregation on all fields.
  2. Query each value using should queries on all fields
  3. Set min_should_match parameter to 5

As you can see at least 5 fields should have that value for a document to be returned. Have a look in the first example here

To get complete document inside a bucket use top hits aggregation as explained here

like image 31
rajat Avatar answered Oct 29 '25 07:10

rajat



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!