We've discovered some duplicate documents in one of our Elasticsearch indices and we haven't been able to work out the cause. There are two copies of each of the affected documents, and they have exactly the same _id
, _type
and _uid
fields.
A GET request to /index-name/document-type/document-id
just returns one copy, but searching for the document with a query like this returns two results, which is quite surprising:
POST /index-name/document-type/_search
{
"filter": {
"term": {
"_id": "document-id"
}
}
}
Aggregating on the _uid
field also identifies the duplicate documents:
POST /index-name/_search
{
"size": 0,
"aggs": {
"duplicates": {
"terms": {
"field": "_uid",
"min_doc_count": 2
}
}
}
}
The duplicates are all on different shards. For example, a document might have one copy on primary shard 0 and one copy on primary shard 1. We've verified this by running the aggregate query above on each shard in turn using the preference parameter: it does not find any duplicates within a single shard.
Our best guess is that something has gone wrong with the routing, but we don't understand how the copies could have been routed to different shards. According to the routing documentation, the default routing is based on the document ID, and should consistently route a document to the same shard.
We are not using custom routing parameters that would override the default routing. We've double-checked this by making sure that the duplicate documents don't have a _routing
field.
We also don't define any parent/child relationships which would also affect routing. (See this question in the Elasticsearch forum, for example, which has the same symptoms as our problem. We don't think the cause is the same because we're not setting any document parents).
We fixed the immediate problem by reindexing into a new index, which squashed the duplicate documents. We still have the old index around for debugging.
We haven't found a way of replicating the problem. The new index is indexing documents correctly, and we've tried rerunning an overnight processing job which also updates documents but it hasn't created any more duplicates.
The cluster has 3 nodes, 3 primary shards and 1 replica (i.e. 3 replica shards). minimum_master_nodes
is set to 2, which should prevent the split-brain issue. We're running Elasticsearch 2.4 (which we know is old - we're planning to upgrade soon).
Does anyone know what might cause these duplicates? Do you have any suggestions for ways to debug it?
We found the answer! The problem was that the index had unexpectedly switched the hashing algorithm it used for routing, and this caused some updated documents to be stored on different shards to their original versions.
A GET request to /index-name/_settings
revealed this:
"version": {
"created": "1070599",
"upgraded": "2040699"
},
"legacy": {
"routing": {
"use_type": "false",
"hash": {
"type": "org.elasticsearch.cluster.routing.DjbHashFunction"
}
}
}
"1070599" refers to Elasticsearch 1.7, and "2040699" is ES 2.4.
It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it was already running 2.4. This is the issue described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383
We think this is what happened to trigger the change:
Back when we upgraded the index from ES 1.7 to 2.4, we decided not to upgrade Elasticsearch in-place, since that would cause downtime. Instead, we created a separate ES 2.4 cluster.
We loaded data into the new cluster using a tool that copied over all the index settings as well as the data, including the version
setting which you should not set in ES 2.4.
While dealing with a recent issue, we happened to close and reopen the index. This normally preserves all the data, but because of the incorrect version
setting, it caused Elasticsearch to think that an upgrade was in processed.
ES automatically set the legacy.routing.hash.type
setting because of the false upgrade. This meant that any data indexed after this point used the old DjbHashFunction
instead of the default Murmur3HashFunction
which had been used to route the data originally.
This means that reindexing the data into a new index was the right thing to do to fix the issue. The new index has the correct version setting and no legacy hash function settings:
"version": {
"created": "2040699"
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With