Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch index sharding explanation

I am trying to figure out the concept of elastic search index and quite don't understand it. I want to make a couple of points in advance. I understand how inverse document index works (mapping terms to document ids), I also understand how the document ranking works based on TF-IDF. What I don't understand is the data structure of the actual index. When referring to the elastic search documentation it describes the index as a "table with mappings to the documents". So, here comes sharding!! When you look at typical picture of the elastic search index, it is represented like this: Elastic search index What the picture doesn't show is how the actual partitioning happens and how this [table -> document] link is split across multiple shards. For instance, does each shard split the table vertically? Meaning the inverted index table only contains terms that are present on the shard. For instance, lets assume we have 3 shards, meaning the first one will contain document1, the second shard only contains document 2 and the 3rd shard is document 3. Now, would the first shard index only contain terms that are present in document1? In this case [Blue, bright, butterfly, breeze, hangs]. If so, what if someone searches for [forget], how does elastic search "knows" not to search in shard 1, or it searches all shards every time? When you look at the cluster image: enter image description here

It is not clear what exactly is in shard1, shard2 and shard3. We go from Term -> DocumentId -> Document to a "rectangular" shard, but what does the shard contain exactly?

I would appreciate if someone can explain it from the picture above.

like image 952
Alex Pryiomka Avatar asked Oct 29 '17 17:10

Alex Pryiomka


1 Answers

Theory

Elastichsarch built on top of Lucene. Every shard is simply a Lucene index. Lucene index, if simplified, is the inverted index. Every Elasticsearch index is a bunch of shards or Lucene indices. When you query for a document, Elasticsearch will subquery all shards, merge results and return it to you. When you index document to Elasticsearch, the Elasticsearch will calculate in which shard document should be written using the formula

shard = hash(routing) % number_of_primary_shards

By default as a routing, the Elasticsearch uses the document id. If you specify routing param, it will be used instead of id. You can use routing param both in search queries and in requests for indexing, deleting or updating a document. By default as a hash function used MurmurHash3

Elasticsearch index

Example

Create index with 3 shards

$ curl -XPUT localhost:9200/so -d '
{ 
    "settings" : { 
        "index" : { 
            "number_of_shards" : 3, 
            "number_of_replicas" : 0 
        } 
    } 
}'

Index document

$ curl -XPUT localhost:9200/so/question/1 -d '
{ 
    "number" : 47011047, 
    "title" : "need elasticsearch index sharding explanation" 
}'

Query without routing

$ curl "localhost:9200/so/question/_search?&pretty"

Response

Look at _shards.total - this is a number of shards that were queried. Also note that we found the document

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        }
      }
    ]
  }
}

Query with correct routing

$ curl "localhost:9200/so/question/_search?explain=true&routing=1&pretty"

Response

_shards.total now 1, because we specified routing and elasticsearch know which shard to ask for document. With param explain=true I ask elasticsearch to give me additional information about query. Pay attention to hits._shard - it's setted to [so][2]. It means that our document stored in second shard of so index.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_shard" : "[so][2]",
        "_node" : "2skA6yiPSVOInMX0ZsD91Q",
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        },
        ...
}

Query with uncorrect routing

$ curl "localhost:9200/so/question/_search?explain=true&routing=2&pretty"

Response

_shards.total again 1. But Elasticsearch return nothing to our query because we specify wrong routing and Elasticsearch query the shard, where there is no document.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Additional information

  • An excellent explanation of the Lucene internals from Adrien Grande
  • A Dive into the Elasticsearch Storage by Njal Karevoll
  • Routing a Document to a Shard
like image 50
Nikita Klimov Avatar answered Nov 18 '22 20:11

Nikita Klimov