I have a mongodb shard with 2 shards (let say A & B), 17GB free space each. I set _id which contains object ID as shard key. Below are commands used to set db and collection. <pre class="prettyprint"><code>sh.enableSharding("testShard"); sh.shardCollection("testShard.shardedCollection", {_id:1}); </code></pre> Then I tried to fire 4,000,000 insert queries to mongos server. I execute script below 4 times. <pre class="prettyprint"><code>for(var i=0; i<1000000; i++){ db.shardedCollection.insert({x:i}); } </code></pre> Using _id as shard key, as per my understanding, 4000000 document as mentioned will fit in 1 shard and all insert will happens in A shard only. However, the result was not as I expected, it is ~1,3 million documents inserted in A shard, another ~2,7 million documents inserted in B shard. Why did it happen? Are something missing in shard coll setting commands? Or my understanding is wrong, maybe there is something like default range shard key in mongodb? It will be very helpful if someone can share the behavior of default range shard key (without tag aware). Below is sh.status() result <pre class="prettyprint"><code> shard key: { "_id" : 1 } chunks: B 5 A 5 { "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("540c703398c7efdea6037cbc") } on : B Timestamp(6, 0) { "_id" : ObjectId("540c703398c7efdea6037cbc") } -->> { "_id" : ObjectId("540c703498c7efdea603bfe3") } on : A Timestamp(6, 1) { "_id" : ObjectId("540c703498c7efdea603bfe3") } -->> { "_id" : ObjectId("540c704398c7efdea605d818") } on : A Timestamp(3, 0) { "_id" : ObjectId("540c704398c7efdea605d818") } -->> { "_id" : ObjectId("540c705298c7efdea607f04e") } on : A Timestamp(4, 0) { "_id" : ObjectId("540c705298c7efdea607f04e") } -->> { "_id" : ObjectId("540c707098c7efdea60c20ba") } on : B Timestamp(5, 1) { "_id" : ObjectId("540c707098c7efdea60c20ba") } -->> { "_id" : ObjectId("540c7144319c0dbee096f7d6") } on : B Timestamp(2, 4) { "_id" : ObjectId("540c7144319c0dbee096f7d6") } -->> { "_id" : ObjectId("540c7183319c0dbee09f58ad") } on : B Timestamp(2, 6) { "_id" : ObjectId("540c7183319c0dbee09f58ad") } -->> { "_id" : ObjectId("540eb15ddace5b39fbc32239") } on : B Timestamp(4, 2) { "_id" : ObjectId("540eb15ddace5b39fbc32239") } -->> { "_id" : ObjectId("540eb192dace5b39fbca8a84") } on : A Timestamp(5, 2) { "_id" : ObjectId("540eb192dace5b39fbca8a84") } -->> { "_id" : { "$maxKey" : 1 } } on : A Timestamp(5, 3) </code></pre>

As @LalitAgarwal already pointed out, ObjectIds make a bad shard key by default. However, if you don't really care on which shard your data lives and only want to have the write operations and the chunks distributed evenly among you shards, this is quite easy to acquire: <pre class="prettyprint"><code>db.shardedCollection.ensureIndex({_id:"hashed"}); sh.enableSharding("testShard"); sh.shardCollection("testShard.shardedCollection", {_id:"hashed"}); </code></pre> However, this comes with some (often negligible) drawbacks: <ol> <li>You have an additional index for the sole purpose of sharding and no other use cases</li> <li>This index will eat some RAM, a precious ressource on high load production nodes</li> <li>This artificial index will need write operations during insertions</li> </ol> A way better approach is to find a non-artificial shard key. Please read Considerations for Selecting Shard Keys for details. In short: <ol> <li>Find a field or combination of fields which unambiguously identify each document and (in combination) greatly differ from each other. Ideally, those should be fields you query for anyway.</li> <li>Use this field or combination of fields as your _id. Since an index is needed on the _id field anyway, and you query for those fields, you got rid of unneeded index(es).</li> <li>Use the selected _id field as your shard key.</li> </ol>

Default range shard key mongodb

Tags:

mongodb

sharding

I have a mongodb shard with 2 shards (let say A & B), 17GB free space each. I set _id which contains object ID as shard key.

Below are commands used to set db and collection.

sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:1});

Then I tried to fire 4,000,000 insert queries to mongos server. I execute script below 4 times.

for(var i=0; i<1000000; i++){
  db.shardedCollection.insert({x:i});
}

Using _id as shard key, as per my understanding, 4000000 document as mentioned will fit in 1 shard and all insert will happens in A shard only.

However, the result was not as I expected, it is ~1,3 million documents inserted in A shard, another ~2,7 million documents inserted in B shard.

Why did it happen? Are something missing in shard coll setting commands? Or my understanding is wrong, maybe there is something like default range shard key in mongodb?

It will be very helpful if someone can share the behavior of default range shard key (without tag aware).

Below is sh.status() result

  shard key: { "_id" : 1 }
  chunks:
    B  5
    A  5
  { "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("540c703398c7efdea6037cbc") } on : B Timestamp(6, 0) 
  { "_id" : ObjectId("540c703398c7efdea6037cbc") } -->> { "_id" : ObjectId("540c703498c7efdea603bfe3") } on : A Timestamp(6, 1) 
  { "_id" : ObjectId("540c703498c7efdea603bfe3") } -->> { "_id" : ObjectId("540c704398c7efdea605d818") } on : A Timestamp(3, 0) 
  { "_id" : ObjectId("540c704398c7efdea605d818") } -->> { "_id" : ObjectId("540c705298c7efdea607f04e") } on : A Timestamp(4, 0) 
  { "_id" : ObjectId("540c705298c7efdea607f04e") } -->> { "_id" : ObjectId("540c707098c7efdea60c20ba") } on : B Timestamp(5, 1) 
  { "_id" : ObjectId("540c707098c7efdea60c20ba") } -->> { "_id" : ObjectId("540c7144319c0dbee096f7d6") } on : B Timestamp(2, 4) 
  { "_id" : ObjectId("540c7144319c0dbee096f7d6") } -->> { "_id" : ObjectId("540c7183319c0dbee09f58ad") } on : B Timestamp(2, 6) 
  { "_id" : ObjectId("540c7183319c0dbee09f58ad") } -->> { "_id" : ObjectId("540eb15ddace5b39fbc32239") } on : B Timestamp(4, 2) 
  { "_id" : ObjectId("540eb15ddace5b39fbc32239") } -->> { "_id" : ObjectId("540eb192dace5b39fbca8a84") } on : A Timestamp(5, 2) 
  { "_id" : ObjectId("540eb192dace5b39fbca8a84") } -->> { "_id" : { "$maxKey" : 1 } } on : A Timestamp(5, 3)

753

asked Sep 12 '14 03:09

rendybjunior

1 Answers

As @LalitAgarwal already pointed out, ObjectIds make a bad shard key by default. However, if you don't really care on which shard your data lives and only want to have the write operations and the chunks distributed evenly among you shards, this is quite easy to acquire:

db.shardedCollection.ensureIndex({_id:"hashed"});
sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:"hashed"});

However, this comes with some (often negligible) drawbacks:

You have an additional index for the sole purpose of sharding and no other use cases
This index will eat some RAM, a precious ressource on high load production nodes
This artificial index will need write operations during insertions

A way better approach is to find a non-artificial shard key. Please read Considerations for Selecting Shard Keys for details. In short:

Find a field or combination of fields which unambiguously identify each document and (in combination) greatly differ from each other. Ideally, those should be fields you query for anyway.
Use this field or combination of fields as your _id. Since an index is needed on the _id field anyway, and you query for those fields, you got rid of unneeded index(es).
Use the selected _id field as your shard key.

181

answered Oct 01 '22 12:10

Markus W Mahlberg

Related questions
                            
                                Using as a graph database for finding "friends" of "friends" in MongoDb
                            
                                mongodb upsert in updating an array element
                            
                                MongoDB two member replica-set
                            
                                Spring Mongo Upsert nested document
                            
                                MongoDB Out of Memory
                            
                                how to define a compound and hashed mongodb index?
                            
                                MongoDB add array to BsonDocument
                            
                                Should I use the timestamp in "_id"?
                            
                                Setting default MongoDb/Bson representation for all decimals to double
                            
                                Node.js Mocha Unit Testing error re: Mongoose mocks with Mockgoose, "Error setting TTL index on collection : sessions"
                            
                                Time to learn Cassandra/ MongoDB [closed]
                            
                                Mongoengine: How to sort Embedded Document list by Embedded document field
                            
                                combining two $elemMatch queries with AND boolean operator in MongoDB
                            
                                mongodump with replica set with oplog throws error: "oplog mode is only supported on full dumps"
                            
                                Mongo DB object Id deserializing using JSON serializer
                            
                                Best way to create a mongo expression that never matches
                            
                                Update MongoDB collection with $toLower or $toUpper
                            
                                Mongoose - accessing nested object with .populate
                            
                                mongoDB. read, search timestamp based on oplog
                            
                                Without JOINs, what is the right way to handle data in document databases?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With