I am trying to figure out the concept of elastic search index and quite don't understand it. I want to make a couple of points in advance. I understand how inverse document index works (mapping terms to document ids), I also understand how the document ranking works based on TF-IDF. What I don't understand is the data structure of the actual index. When referring to the elastic search documentation it describes the index as a "table with mappings to the documents". So, here comes sharding!! When you look at typical picture of the elastic search index, it is represented like this: What the picture doesn't show is how the actual partitioning happens and how this [table -> document] link is split across multiple shards. For instance, does each shard split the table vertically? Meaning the inverted index table only contains terms that are present on the shard. For instance, lets assume we have 3 shards, meaning the first one will contain document1, the second shard only contains document 2 and the 3rd shard is document 3. Now, would the first shard index only contain terms that are present in document1? In this case [Blue, bright, butterfly, breeze, hangs]. If so, what if someone searches for [forget], how does elastic search "knows" not to search in shard 1, or it searches all shards every time? When you look at the cluster image:
It is not clear what exactly is in shard1, shard2 and shard3. We go from Term -> DocumentId -> Document to a "rectangular" shard, but what does the shard contain exactly?
I would appreciate if someone can explain it from the picture above.
Elastichsarch built on top of Lucene. Every shard is simply a Lucene index. Lucene index, if simplified, is the inverted index. Every Elasticsearch index is a bunch of shards or Lucene indices. When you query for a document, Elasticsearch will subquery all shards, merge results and return it to you. When you index document to Elasticsearch, the Elasticsearch will calculate in which shard document should be written using the formula
shard = hash(routing) % number_of_primary_shards
By default as a routing, the Elasticsearch uses the document id
. If you specify routing
param, it will be used instead of id
. You can use routing
param both in search queries and in requests for indexing, deleting or updating a document.
By default as a hash function used MurmurHash3
$ curl -XPUT localhost:9200/so -d '
{
"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 0
}
}
}'
$ curl -XPUT localhost:9200/so/question/1 -d '
{
"number" : 47011047,
"title" : "need elasticsearch index sharding explanation"
}'
$ curl "localhost:9200/so/question/_search?&pretty"
Look at _shards.total
- this is a number of shards that were queried. Also note that we found the document
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "so",
"_type" : "question",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"number" : 47011047,
"title" : "need elasticsearch index sharding explanation"
}
}
]
}
}
$ curl "localhost:9200/so/question/_search?explain=true&routing=1&pretty"
_shards.total
now 1, because we specified routing and elasticsearch know which shard to ask for document. With param explain=true
I ask elasticsearch to give me additional information about query. Pay attention to hits._shard
- it's setted to [so][2]
. It means that our document stored in second shard of so
index.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_shard" : "[so][2]",
"_node" : "2skA6yiPSVOInMX0ZsD91Q",
"_index" : "so",
"_type" : "question",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"number" : 47011047,
"title" : "need elasticsearch index sharding explanation"
},
...
}
$ curl "localhost:9200/so/question/_search?explain=true&routing=2&pretty"
_shards.total again 1. But Elasticsearch return nothing to our query because we specify wrong routing and Elasticsearch query the shard, where there is no document.
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With