I have a mongodb shard with 2 shards (let say A & B), 17GB free space each. I set _id which contains object ID as shard key.
Below are commands used to set db and collection.
sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:1});
Then I tried to fire 4,000,000 insert queries to mongos server. I execute script below 4 times.
for(var i=0; i<1000000; i++){
db.shardedCollection.insert({x:i});
}
Using _id as shard key, as per my understanding, 4000000 document as mentioned will fit in 1 shard and all insert will happens in A shard only.
However, the result was not as I expected, it is ~1,3 million documents inserted in A shard, another ~2,7 million documents inserted in B shard.
Why did it happen? Are something missing in shard coll setting commands? Or my understanding is wrong, maybe there is something like default range shard key in mongodb?
It will be very helpful if someone can share the behavior of default range shard key (without tag aware).
Below is sh.status() result
shard key: { "_id" : 1 }
chunks:
B 5
A 5
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("540c703398c7efdea6037cbc") } on : B Timestamp(6, 0)
{ "_id" : ObjectId("540c703398c7efdea6037cbc") } -->> { "_id" : ObjectId("540c703498c7efdea603bfe3") } on : A Timestamp(6, 1)
{ "_id" : ObjectId("540c703498c7efdea603bfe3") } -->> { "_id" : ObjectId("540c704398c7efdea605d818") } on : A Timestamp(3, 0)
{ "_id" : ObjectId("540c704398c7efdea605d818") } -->> { "_id" : ObjectId("540c705298c7efdea607f04e") } on : A Timestamp(4, 0)
{ "_id" : ObjectId("540c705298c7efdea607f04e") } -->> { "_id" : ObjectId("540c707098c7efdea60c20ba") } on : B Timestamp(5, 1)
{ "_id" : ObjectId("540c707098c7efdea60c20ba") } -->> { "_id" : ObjectId("540c7144319c0dbee096f7d6") } on : B Timestamp(2, 4)
{ "_id" : ObjectId("540c7144319c0dbee096f7d6") } -->> { "_id" : ObjectId("540c7183319c0dbee09f58ad") } on : B Timestamp(2, 6)
{ "_id" : ObjectId("540c7183319c0dbee09f58ad") } -->> { "_id" : ObjectId("540eb15ddace5b39fbc32239") } on : B Timestamp(4, 2)
{ "_id" : ObjectId("540eb15ddace5b39fbc32239") } -->> { "_id" : ObjectId("540eb192dace5b39fbca8a84") } on : A Timestamp(5, 2)
{ "_id" : ObjectId("540eb192dace5b39fbca8a84") } -->> { "_id" : { "$maxKey" : 1 } } on : A Timestamp(5, 3)
The shard key is either a single indexed field or multiple fields covered by a compound index that determines the distribution of the collection's documents among the cluster's shards.
The distribution of data affects the efficiency and performance of operations within the sharded cluster. The ideal shard key allows MongoDB to distribute documents evenly throughout the cluster while also facilitating common query patterns. When you choose your shard key, consider: the cardinality of the shard key.
But it looks like, if we shard by user_id, which has been added to the second collection, this is not sufficient, the shard key needs to be unique, so every time we did a lookup, we'd be going across all shards. This is not optimal.
As @LalitAgarwal already pointed out, ObjectIds make a bad shard key by default. However, if you don't really care on which shard your data lives and only want to have the write operations and the chunks distributed evenly among you shards, this is quite easy to acquire:
db.shardedCollection.ensureIndex({_id:"hashed"});
sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:"hashed"});
However, this comes with some (often negligible) drawbacks:
A way better approach is to find a non-artificial shard key. Please read Considerations for Selecting Shard Keys for details. In short:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With