Relation between shard keys and chunks in MongoDB sharded cluster?

Tags:

sharding

I can't really understand the shard key concept in a MongoDB sharded cluster, as I've just started learning MongoDB.

Citing the MongoDB documentation:

A chunk is a contiguous range of shard key values assigned to a particular shard. When they grow beyond the configured chunk size, a mongos splits the chunk into two chunks.

It seems that chuck size is something related to a particular shard, not to the cluster itself. Am I right?

Speaking about the cardinality of a shard key:

Consider the use of a state field as a shard key:

The state key’s value holds the US state for a given address document. This field has a low cardinality as all documents that have the same value in the state field must reside on the same shard, even if a particular state’s chunk exceeds the maximum chunk size.

Since there are a limited number of possible values for the state field, MongoDB may distribute data unevenly among a small number of fixed chunks.

My question is how the shard key relates to the chunk size.

It seems to me that, having just two shard servers, it wouldn't be possible to distribute the data because same value in the state field must reside on the same shard. With three documents with states like Arizona, Indiana and Maine, how data is distributed among just two shards?

748

asked May 04 '13 20:05

gremo

2 Answers

In order to understand the answer to your question you need to understand range based partitioning. If you have N documents they will be partitioned into chunks - the way the split points are determined is based on your shard key.

With shard key being some field in your document, all the possible values of the shard key will be considered and all the documents will be (logically) split into chunks/ranges, based on what value each document's shard key is.

In your example there are 50 possible values for "state" (okay, probably more like 52) so at most there can only be 52 chunks. Default chunk size is 64MB. Now imagine that you are sharding a collection with ten million documents which are 1K each. Each chunk should not contain more than about 65K documents. Ten million documents should be split into more than 150 chunks, but we only have 52 distinct values for the shard key! So your chunks are going to be very large. Why is that a problem? Well, in order to auto-balance chunk among shards the system needs to migrate chunks between shards and if the chunk is too big, it can't be moved. And since it can't be split, you'll be stuck with unbalanced cluster.

176

answered Oct 03 '22 14:10

Asya Kamsky

There is definitely a relationship between shard key and chunk size. You want to choose a shard key with a high level of cardinality. That is, you want a shard key that can have many possible values as opposed to something like State which is basically locked into only 50 possible values. Low cardinality shard keys like that can result in chunks that consist of only one of the shard key values and thus can not be split and moved to another shard in a balancing operation.

High cardinality of the shard key (like a person's phone number as opposed to their State or Zip Code) is essential to ensure even distribution of data. Low cardinality shard keys can lead to larger chunks (because you have more contiguous values that need to be kept together) that can not be split.

answered Oct 03 '22 14:10

cmbaxter

Related questions
                            
                                Add some kind of row number to a mongodb aggregate command / pipeline
                            
                                How to register mongoose plugin for all schemas?
                            
                                MongoDB Self-signed SSL connection: SSL peer certificate validation failed
                            
                                Concurrency in gopkg.in/mgo.v2 (Mongo, Go)
                            
                                View last N documents using MongoDB Compass
                            
                                MongoDB the difference between db.getCollection.find and db.tablename.find?
                            
                                Does Azure Cosmos DB - Mongo API support aggregation
                            
                                Spark Mongodb Connector Scala - Missing database name
                            
                                Implementing pagination in vanilla GraphQL
                            
                                How to connect to Atlas M0 (Free Tier) cluster correctly via Java driver?
                            
                                Using "$count" Within an "addField" Operation in MongoDB Aggregation
                            
                                How to update deeply nested array with C# MongoDB.Driver?
                            
                                Is it normal size for MongoDB?
                            
                                Looking for distributed/scalable database solution where all nodes are read/write? Not MongoDB? [closed]
                            
                                Is remove an expensive operation in MongoDB?
                            
                                How to use SetField in FindOne in MongoDB For C# Driver
                            
                                Mongoengine geospatial search
                            
                                How to get removed document in MongoDB?
                            
                                Meteor collection update with traditional id
                            
                                mongoDB set name does not match

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With