Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relation between shard keys and chunks in MongoDB sharded cluster?

I can't really understand the shard key concept in a MongoDB sharded cluster, as I've just started learning MongoDB.

Citing the MongoDB documentation:

A chunk is a contiguous range of shard key values assigned to a particular shard. When they grow beyond the configured chunk size, a mongos splits the chunk into two chunks.

It seems that chuck size is something related to a particular shard, not to the cluster itself. Am I right?

Speaking about the cardinality of a shard key:

Consider the use of a state field as a shard key:

The state key’s value holds the US state for a given address document. This field has a low cardinality as all documents that have the same value in the state field must reside on the same shard, even if a particular state’s chunk exceeds the maximum chunk size.

Since there are a limited number of possible values for the state field, MongoDB may distribute data unevenly among a small number of fixed chunks.

My question is how the shard key relates to the chunk size.

It seems to me that, having just two shard servers, it wouldn't be possible to distribute the data because same value in the state field must reside on the same shard. With three documents with states like Arizona, Indiana and Maine, how data is distributed among just two shards?

like image 748
gremo Avatar asked May 04 '13 20:05

gremo


People also ask

How does shard key work in MongoDB?

MongoDB uses the shard key to distribute a collection's documents across shards. MongoDB splits the data into “chunks”, by dividing the span of shard key values into non-overlapping ranges. MongoDB then attempts to distribute those chunks evenly among the shards in the cluster.

At which point does the balancer decide to start moving chunks from one shard to another?

The balancer process sends the moveChunk command to the source shard. The source starts the move when it receives an internal moveRange command.

How can you increase the cardinality of the shard key?

To increase the cardinality of your shard key or change the distribution of your shard key values, you can: refine your shard key by adding a suffix field or fields to the existing key to increase cardinality. reshard your collection using a different shard key with higher cardinality.


2 Answers

In order to understand the answer to your question you need to understand range based partitioning. If you have N documents they will be partitioned into chunks - the way the split points are determined is based on your shard key.

With shard key being some field in your document, all the possible values of the shard key will be considered and all the documents will be (logically) split into chunks/ranges, based on what value each document's shard key is.

In your example there are 50 possible values for "state" (okay, probably more like 52) so at most there can only be 52 chunks. Default chunk size is 64MB. Now imagine that you are sharding a collection with ten million documents which are 1K each. Each chunk should not contain more than about 65K documents. Ten million documents should be split into more than 150 chunks, but we only have 52 distinct values for the shard key! So your chunks are going to be very large. Why is that a problem? Well, in order to auto-balance chunk among shards the system needs to migrate chunks between shards and if the chunk is too big, it can't be moved. And since it can't be split, you'll be stuck with unbalanced cluster.

like image 176
Asya Kamsky Avatar answered Oct 03 '22 14:10

Asya Kamsky


There is definitely a relationship between shard key and chunk size. You want to choose a shard key with a high level of cardinality. That is, you want a shard key that can have many possible values as opposed to something like State which is basically locked into only 50 possible values. Low cardinality shard keys like that can result in chunks that consist of only one of the shard key values and thus can not be split and moved to another shard in a balancing operation.

High cardinality of the shard key (like a person's phone number as opposed to their State or Zip Code) is essential to ensure even distribution of data. Low cardinality shard keys can lead to larger chunks (because you have more contiguous values that need to be kept together) that can not be split.

like image 26
cmbaxter Avatar answered Oct 03 '22 14:10

cmbaxter