Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sharding by ObjectID, is it the right way?

Tags:

mongodb

I'm just like many others is thinking about correct approach to shard my collections in Mongo. Main question is - how does auto-sharding work?

The official doc says - "MongoDB scales horizontally via an auto-sharding (partitioning) architecture" and "To partition a collection, we specify a shard key pattern." with note "It is important to choose the right shard key for a collection" :).
http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-ShardKeys
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

Now the question is - "is this right key"(sharding by ObjectID)?

db.runCommand({ shardcollection : "test", key : { _id : 1 }})

What happens internally in Mongo for ? How Mongo will split data to chunks in this case? Assuming i initially have 10mln of records with 2 shard servers - what happens on Mongo side when I'd like to add 2 more shard server when collection reaches 20mln records? I could not find that level of details anywhere on Mongo-related sources.

Taking into account random nature of autogenerated _id and it's structure,

... http://www.mongodb.org/display/DOCS/Object+IDs ...

i would shard by the least significant byte (rtl order) with chunks split by value of 2-3 bytes - this would provide easy way to shard by 2^N of shard servers - 2, 4, 8, .., 256 shard servers with more-or-less even load on each shard and with minimal required configuration. As far as i understand Mongo supports only sharding/chunking by explicitly defined ranges and that my idea will not work. Is is true?

like image 547
Xtra Coder Avatar asked Feb 06 '12 17:02

Xtra Coder


People also ask

Is sharding horizontal scaling?

Horizontal scaling (aka sharding) is when you actually split your data into smaller, independent buckets and keep adding new buckets as needed. A sharded environment has two or more groups of MySQL servers that are isolated and independent from each other.

What is the process of sharding?

Sharding is a method for distributing or partitioning data across multiple machines. It is useful when no single machine can handle large modern-day workloads, by allowing you to scale horizontally. Horizontal scaling, also known as scale-out, refers to adding machines to share the data set and load.

Does sharding improve query performance?

Sharding was one of the first ways databases were distributed to improve performance. Recent innovations have made it one of the best. Databases are now given an enviable amount of attention since they manage a company's most important property: data.


1 Answers

It's generally not a good idea to use the default object id as the shard key since it has an embedded timestamp and monotonically increases in time. This may work fine if you do a lot of updates such that it touches old and new documents in an evenly distributed fashion. However, this is really bad news if your application is heavy on inserts since majority of your writes will go to a single shard. This is because the writes will go to the shard that owns the [nearCurrentTimestamp -> infinity] chunk.

Each mongos monitors write traffic to shards and use a very simple heuristic to determine if a chunk has become too big and needs to be split (threshold size is configurable via chunkSize).

When you add a new shard to the cluster, the balancer (http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-Balancing) will see a chunk imbalance and will start migrating chunks to the new shards.

Mongo supports range based sharding, however, that does not mean that the ranges are fixed since chunks can be split into smaller ranges and moved around the cluster over time.

like image 190
Ren Avatar answered Oct 23 '22 11:10

Ren