Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: Remove duplicates from index

I have an index with multiple duplicate entries. They have different ids but the other fields have identical content.

For example:

{id: 1, content: 'content1'}
{id: 2, content: 'content1'}
{id: 3, content: 'content2'}
{id: 4, content: 'content2'}

After removing the duplicates:

{id: 1, content: 'content1'}
{id: 3, content: 'content2'}

Is there a way to delete all duplicates and keep only one distinct entry without manually comparing all entries?

like image 293
fwind Avatar asked Jun 01 '15 13:06

fwind


2 Answers

This can be accomplished in several ways. Below I outline two possible approaches:

1) If you don't mind generating new _id values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id for documents as they are written into the new collection. Since the _id field must be unique, any documents that have the same fingerprint will be written to the same _id and therefore deduplicated.

2) You can write a custom script that scrolls over your index. As each document is read, you can create a hash from the fields that you consider to define a unique document (in your case, the content field). Then use this hash as they key in a dictionary (aka hash table). The value associated with this key would be a list of all of the document's _ids that generate this same hash. Once you have all of the hashes and associated lists of _ids, you can execute a delete operation on all but one of the _ids that are associated with each identical hash. Note that this second approach does not require writing documents to a new index in order to de-duplicate, as you would delete documents directly from the original index.

I have written a blog post and code that demonstrates both of these approaches at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Disclaimer: I am a Consulting Engineer at Elastic.

like image 61
Alexander Marquardt Avatar answered Nov 20 '22 04:11

Alexander Marquardt


I use rails and if necessary I will import things with the FORCE=y command, which removes and re-indexes everything for that index and type... however not sure what environment you are running ES in. Only issue I can see is if the data source you are importing from (i.e. Database) has duplicate records. I guess I would see first if the data source could be fixed, if that is feasible, and you re-index everything; otherwise you could try to create a custom import method that only indexes one of the duplicate items for each record.

Furthermore, and I know this doesn't comply with you wanting to remove duplicate entries, but you could simply customize your search so that you are only returning one of the duplicate ids back, either by most recent "timestamp" or indexing deduplicated data and grouping by your content field -- see if this post helps. Even though this would still retain the duplicate records in your index, at least they won't come up in the search results.

I also found this as well: Elasticsearch delete duplicates

I tried thinking of many possible scenarios for you to see if any of those options work or at least could be a temp fix.

like image 26
jflay Avatar answered Nov 20 '22 06:11

jflay