Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch delete duplicates

Some of the records are duplicated in my index identified by a numeric field recordid.

There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?

Or some other way to achieve this?

like image 269
FUD Avatar asked Jul 19 '14 10:07

FUD


People also ask

How to remove duplicate documents in Elasticsearch?

1) If you don't mind generating new _id values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id for documents as they are ...


2 Answers

Yes, you can find duplicated document with an aggregation query:

curl -XPOST http://localhost:9200/your_index/_search -d '
 {
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "recordid",
        "min_doc_count": 2,
        "size": 10
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}'

then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).

NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).

like image 77
Tombart Avatar answered Sep 19 '22 05:09

Tombart


Elasticsearch recommends "use(ing) the scroll/scan API to find all matching ids and then issue a bulk request to delete them".

**Edited

like image 30
Andy Avatar answered Sep 18 '22 05:09

Andy