I have some sharded collections. Their size is ranging between 50-90MiB in MongoDB 2.4.11. Default chunk size as per documentation is 64MB.
When I check chunk distribution using commands below,
db.getCollection(collName).getShardDistribution()
it is showing that
Some collections with size below 64MB has been splitted into several chunks.
data : 58.13MiB docs : 148540 chunks : 2
estimated data per chunk : 29.06MiB
estimated docs per chunk : 74270
Some collections with size x where 64MB < x < 128 MB has more than 2 chunks.
data : 98.24MiB docs : 277520 chunks : 4
estimated data per chunk : 24.56MiB
estimated docs per chunk : 69380
Is this behaviour expected? How does this happen?
The value of 64MB (which is configurable) is the maximum chunk size not the target chunk size. Generally chunks will be created at a little under half that size as a rule of thumb, but there are a number of factors and essentially this is normal and nothing to be concerned about.
To explain a little more, chunks will generally be split long before they reach the maximum size. There are two mechanisms which will lead to splits, one only applies to the initial sharding of a collection and the other will run all the time (as long as there are writes happening and it is not disabled).
Both mechanisms actually use the same command to determine if a chunk should be split or not, the internal command splitVector(). When splitVector is called, it examines the range specified (in this case the entire collection) and returns one or more split points if there are any (an empty array means that the chunk is correctly sized and does not need to be split).
The subsequent splitting of chunks is done by the mongos. Any mongos you use to write to the collection will track how much data is written to a given chunk and it will periodically check (based on the amount written to the chunk) as to whether there are any valid split points, again utilizing splitVector to do so. If valid split points are found, it will attempt the split when it can next obtain the required lock.
You might be wondering how it picks the split points - that's a little complicated, and it can be based on data size or number of documents, and of course what you have the max chunk size set to. The best way to check for your particular data set is to do a little testing. For example, here are two collections, foo.data and bar.data. I created bar.data with just 50MiB of data, and foo.data with 200MiB of data - they both have the same size documents. The bar.data collection was not split, so splitVector was happy with this chunk remaining as-is, while the foo.data collection was split into 9 initial chunks of similar size to what you are seeing (~24MiB):
{ "_id" : "bar", "partitioned" : true, "primary" : "shard0000" }
bar.data
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
foo.data
shard key: { "_id" : 1 }
chunks:
shard0000 9
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("0a831759adacefd1231e6939") } on : shard0000 Timestamp(1, 0)
{ "_id" : ObjectId("0a831759adacefd1231e6939") } -->> { "_id" : ObjectId("150f322badacefd1233c920a") } on : shard0000 Timestamp(1, 1)
{ "_id" : ObjectId("150f322badacefd1233c920a") } -->> { "_id" : ObjectId("1f9bfd35adacefd1235b2786") } on : shard0000 Timestamp(1, 2)
{ "_id" : ObjectId("1f9bfd35adacefd1235b2786") } -->> { "_id" : ObjectId("2a213937adacefd1237829cb") } on : shard0000 Timestamp(1, 3)
{ "_id" : ObjectId("2a213937adacefd1237829cb") } -->> { "_id" : ObjectId("34b25e1cadacefd12396d4b1") } on : shard0000 Timestamp(1, 4)
{ "_id" : ObjectId("34b25e1cadacefd12396d4b1") } -->> { "_id" : ObjectId("3f3643feadacefd123b4a8f2") } on : shard0000 Timestamp(1, 5)
{ "_id" : ObjectId("3f3643feadacefd123b4a8f2") } -->> { "_id" : ObjectId("49c8edafadacefd123d33325") } on : shard0000 Timestamp(1, 6)
{ "_id" : ObjectId("49c8edafadacefd123d33325") } -->> { "_id" : ObjectId("5458e4ddadacefd123f14eb5") } on : shard0000 Timestamp(1, 7)
{ "_id" : ObjectId("5458e4ddadacefd123f14eb5") } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 8)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With