Some of the records are duplicated in my index identified by a numeric field recordid
.
There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?
Or some other way to achieve this?
1) If you don't mind generating new _id values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id for documents as they are ...
Yes, you can find duplicated document with an aggregation query:
curl -XPOST http://localhost:9200/your_index/_search -d '
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "recordid",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 10
}
}
}
}
}
}'
then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).
NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).
Elasticsearch recommends "use(ing) the scroll/scan API to find all matching ids and then issue a bulk request to delete them".
**Edited
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With