Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Default range shard key mongodb

I have a mongodb shard with 2 shards (let say A & B), 17GB free space each. I set _id which contains object ID as shard key.

Below are commands used to set db and collection.

sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:1});

Then I tried to fire 4,000,000 insert queries to mongos server. I execute script below 4 times.

for(var i=0; i<1000000; i++){
  db.shardedCollection.insert({x:i});
}

Using _id as shard key, as per my understanding, 4000000 document as mentioned will fit in 1 shard and all insert will happens in A shard only.

However, the result was not as I expected, it is ~1,3 million documents inserted in A shard, another ~2,7 million documents inserted in B shard.

Why did it happen? Are something missing in shard coll setting commands? Or my understanding is wrong, maybe there is something like default range shard key in mongodb?

It will be very helpful if someone can share the behavior of default range shard key (without tag aware).

Below is sh.status() result

  shard key: { "_id" : 1 }
  chunks:
    B  5
    A  5
  { "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("540c703398c7efdea6037cbc") } on : B Timestamp(6, 0) 
  { "_id" : ObjectId("540c703398c7efdea6037cbc") } -->> { "_id" : ObjectId("540c703498c7efdea603bfe3") } on : A Timestamp(6, 1) 
  { "_id" : ObjectId("540c703498c7efdea603bfe3") } -->> { "_id" : ObjectId("540c704398c7efdea605d818") } on : A Timestamp(3, 0) 
  { "_id" : ObjectId("540c704398c7efdea605d818") } -->> { "_id" : ObjectId("540c705298c7efdea607f04e") } on : A Timestamp(4, 0) 
  { "_id" : ObjectId("540c705298c7efdea607f04e") } -->> { "_id" : ObjectId("540c707098c7efdea60c20ba") } on : B Timestamp(5, 1) 
  { "_id" : ObjectId("540c707098c7efdea60c20ba") } -->> { "_id" : ObjectId("540c7144319c0dbee096f7d6") } on : B Timestamp(2, 4) 
  { "_id" : ObjectId("540c7144319c0dbee096f7d6") } -->> { "_id" : ObjectId("540c7183319c0dbee09f58ad") } on : B Timestamp(2, 6) 
  { "_id" : ObjectId("540c7183319c0dbee09f58ad") } -->> { "_id" : ObjectId("540eb15ddace5b39fbc32239") } on : B Timestamp(4, 2) 
  { "_id" : ObjectId("540eb15ddace5b39fbc32239") } -->> { "_id" : ObjectId("540eb192dace5b39fbca8a84") } on : A Timestamp(5, 2) 
  { "_id" : ObjectId("540eb192dace5b39fbca8a84") } -->> { "_id" : { "$maxKey" : 1 } } on : A Timestamp(5, 3) 
like image 753
rendybjunior Avatar asked Sep 12 '14 03:09

rendybjunior


People also ask

What is shard key in MongoDB?

The shard key is either a single indexed field or multiple fields covered by a compound index that determines the distribution of the collection's documents among the cluster's shards.

What is a good shard key?

The distribution of data affects the efficiency and performance of operations within the sharded cluster. The ideal shard key allows MongoDB to distribute documents evenly throughout the cluster while also facilitating common query patterns. When you choose your shard key, consider: the cardinality of the shard key.

Does shard key have to be unique?

But it looks like, if we shard by user_id, which has been added to the second collection, this is not sufficient, the shard key needs to be unique, so every time we did a lookup, we'd be going across all shards. This is not optimal.


1 Answers

As @LalitAgarwal already pointed out, ObjectIds make a bad shard key by default. However, if you don't really care on which shard your data lives and only want to have the write operations and the chunks distributed evenly among you shards, this is quite easy to acquire:

db.shardedCollection.ensureIndex({_id:"hashed"});
sh.enableSharding("testShard");
sh.shardCollection("testShard.shardedCollection", {_id:"hashed"});

However, this comes with some (often negligible) drawbacks:

  1. You have an additional index for the sole purpose of sharding and no other use cases
  2. This index will eat some RAM, a precious ressource on high load production nodes
  3. This artificial index will need write operations during insertions

A way better approach is to find a non-artificial shard key. Please read Considerations for Selecting Shard Keys for details. In short:

  1. Find a field or combination of fields which unambiguously identify each document and (in combination) greatly differ from each other. Ideally, those should be fields you query for anyway.
  2. Use this field or combination of fields as your _id. Since an index is needed on the _id field anyway, and you query for those fields, you got rid of unneeded index(es).
  3. Use the selected _id field as your shard key.
like image 181
Markus W Mahlberg Avatar answered Oct 01 '22 12:10

Markus W Mahlberg