I'm just like many others is thinking about correct approach to shard my collections in Mongo. Main question is - how does auto-sharding work?
The official doc says - "MongoDB scales horizontally via an auto-sharding (partitioning) architecture" and "To partition a collection, we specify a shard key pattern." with note "It is important to choose the right shard key for a collection" :).
http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-ShardKeys
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
Now the question is - "is this right key"(sharding by ObjectID)?
db.runCommand({ shardcollection : "test", key : { _id : 1 }})
What happens internally in Mongo for ? How Mongo will split data to chunks in this case? Assuming i initially have 10mln of records with 2 shard servers - what happens on Mongo side when I'd like to add 2 more shard server when collection reaches 20mln records? I could not find that level of details anywhere on Mongo-related sources.
Taking into account random nature of autogenerated _id and it's structure,
... http://www.mongodb.org/display/DOCS/Object+IDs ...
i would shard by the least significant byte (rtl order) with chunks split by value of 2-3 bytes - this would provide easy way to shard by 2^N of shard servers - 2, 4, 8, .., 256 shard servers with more-or-less even load on each shard and with minimal required configuration. As far as i understand Mongo supports only sharding/chunking by explicitly defined ranges and that my idea will not work. Is is true?
Horizontal scaling (aka sharding) is when you actually split your data into smaller, independent buckets and keep adding new buckets as needed. A sharded environment has two or more groups of MySQL servers that are isolated and independent from each other.
Sharding is a method for distributing or partitioning data across multiple machines. It is useful when no single machine can handle large modern-day workloads, by allowing you to scale horizontally. Horizontal scaling, also known as scale-out, refers to adding machines to share the data set and load.
Sharding was one of the first ways databases were distributed to improve performance. Recent innovations have made it one of the best. Databases are now given an enviable amount of attention since they manage a company's most important property: data.
It's generally not a good idea to use the default object id as the shard key since it has an embedded timestamp and monotonically increases in time. This may work fine if you do a lot of updates such that it touches old and new documents in an evenly distributed fashion. However, this is really bad news if your application is heavy on inserts since majority of your writes will go to a single shard. This is because the writes will go to the shard that owns the [nearCurrentTimestamp -> infinity] chunk.
Each mongos monitors write traffic to shards and use a very simple heuristic to determine if a chunk has become too big and needs to be split (threshold size is configurable via chunkSize).
When you add a new shard to the cluster, the balancer (http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-Balancing) will see a chunk imbalance and will start migrating chunks to the new shards.
Mongo supports range based sharding, however, that does not mean that the ranges are fixed since chunks can be split into smaller ranges and moved around the cluster over time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With