Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicate documents in Elasticsearch index with the same _uid

We've discovered some duplicate documents in one of our Elasticsearch indices and we haven't been able to work out the cause. There are two copies of each of the affected documents, and they have exactly the same _id, _type and _uid fields.

A GET request to /index-name/document-type/document-id just returns one copy, but searching for the document with a query like this returns two results, which is quite surprising:

POST /index-name/document-type/_search
{
  "filter": {
    "term": {
      "_id": "document-id"
    }
  }
}

Aggregating on the _uid field also identifies the duplicate documents:

POST /index-name/_search
{
  "size": 0,
  "aggs": {
    "duplicates": {
      "terms": {
        "field": "_uid",
        "min_doc_count": 2
      }
    }
  }
}

The duplicates are all on different shards. For example, a document might have one copy on primary shard 0 and one copy on primary shard 1. We've verified this by running the aggregate query above on each shard in turn using the preference parameter: it does not find any duplicates within a single shard.

Our best guess is that something has gone wrong with the routing, but we don't understand how the copies could have been routed to different shards. According to the routing documentation, the default routing is based on the document ID, and should consistently route a document to the same shard.

We are not using custom routing parameters that would override the default routing. We've double-checked this by making sure that the duplicate documents don't have a _routing field.

We also don't define any parent/child relationships which would also affect routing. (See this question in the Elasticsearch forum, for example, which has the same symptoms as our problem. We don't think the cause is the same because we're not setting any document parents).

We fixed the immediate problem by reindexing into a new index, which squashed the duplicate documents. We still have the old index around for debugging.

We haven't found a way of replicating the problem. The new index is indexing documents correctly, and we've tried rerunning an overnight processing job which also updates documents but it hasn't created any more duplicates.

The cluster has 3 nodes, 3 primary shards and 1 replica (i.e. 3 replica shards). minimum_master_nodes is set to 2, which should prevent the split-brain issue. We're running Elasticsearch 2.4 (which we know is old - we're planning to upgrade soon).

Does anyone know what might cause these duplicates? Do you have any suggestions for ways to debug it?

like image 908
Suzanne Avatar asked Nov 02 '17 18:11

Suzanne


1 Answers

We found the answer! The problem was that the index had unexpectedly switched the hashing algorithm it used for routing, and this caused some updated documents to be stored on different shards to their original versions.

A GET request to /index-name/_settings revealed this:

"version": {
  "created": "1070599",
  "upgraded": "2040699"
},
"legacy": {
  "routing": {
    "use_type": "false",
    "hash": {
      "type": "org.elasticsearch.cluster.routing.DjbHashFunction"
    }
  }
}

"1070599" refers to Elasticsearch 1.7, and "2040699" is ES 2.4.

It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it was already running 2.4. This is the issue described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383

We think this is what happened to trigger the change:

  1. Back when we upgraded the index from ES 1.7 to 2.4, we decided not to upgrade Elasticsearch in-place, since that would cause downtime. Instead, we created a separate ES 2.4 cluster.

    We loaded data into the new cluster using a tool that copied over all the index settings as well as the data, including the version setting which you should not set in ES 2.4.

  2. While dealing with a recent issue, we happened to close and reopen the index. This normally preserves all the data, but because of the incorrect version setting, it caused Elasticsearch to think that an upgrade was in processed.

  3. ES automatically set the legacy.routing.hash.type setting because of the false upgrade. This meant that any data indexed after this point used the old DjbHashFunction instead of the default Murmur3HashFunction which had been used to route the data originally.

This means that reindexing the data into a new index was the right thing to do to fix the issue. The new index has the correct version setting and no legacy hash function settings:

"version": {
  "created": "2040699"
}
like image 182
Suzanne Avatar answered Nov 17 '22 07:11

Suzanne