Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongo sharding fails to split large collection between shards

I'm having problems with what seems to be a simple sharding setup in mongo.

I have two shards, a single mongos instance, and a single config server set up like this:

Machine A - 10.0.44.16 - config server, mongos
Machine B - 10.0.44.10 - shard 1
Machine C - 10.0.44.11 - shard 2

I have a collection called 'Seeds' that has a shard key 'SeedType' which is a field that is present on every document in the collection, and contains one of four values (take a look at the sharding status below). Two of the values have significantly more entries than the other two (two of them have 784,000 records each, and two have about 5,000).

The behavior I'm expecting to see is that records in the 'Seeds' collection with InventoryPOS will end up on one shard, and the ones with InventoryOnHand will end up on the other.

However, it seems that all records for both the two larger shard keys end up on the primary shard.

Here's my sharding status text (other collections removed for clarity):

--- Sharding Status ---
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
      { "_id" : "shard0000", "host" : "10.44.0.11:27019" }
      { "_id" : "shard0001", "host" : "10.44.0.10:27017" }
  databases:
        { "_id" : "admin", "partitioned" : false, "primary" : "config" }
        { "_id" : "TimMulti", "partitioned" : true, "primary" : "shard0001" }
                TimMulti.Seeds chunks:
                        { "SeedType" : { $minKey : 1 } } -->> { "SeedType" : "PBI.AnalyticsServer.KPI" } on : shard0000 { "t" : 2000, "i" : 0 }
                        { "SeedType" : "PBI.AnalyticsServer.KPI" } -->> { "SeedType" : "PBI.Retail.InventoryOnHand" } on : shard0001 { "t" : 2000, "i" : 7 }
                        { "SeedType" : "PBI.Retail.InventoryOnHand" } -->> { "SeedType" : "PBI.Retail.InventoryPOS" } on : shard0001 { "t" : 2000, "i" : 8 }
                        { "SeedType" : "PBI.Retail.InventoryPOS" } -->> { "SeedType" : "PBI.Retail.SKU" } on : shard0001 { "t" : 2000, "i" : 9 }
                        { "SeedType" : "PBI.Retail.SKU" } -->> { "SeedType" : { $maxKey : 1 } } on : shard0001 { "t" : 2000, "i" : 10 }

Am I doing anything wrong?

Semi-unrelated question:

What is the best way to atomically transfer an object from one collection to another without blocking the entire mongo service?

Thanks in advance, -Tim

like image 597
Tim Avatar asked Sep 10 '10 00:09

Tim


People also ask

Is sharding better than replication?

What is the difference between replication and sharding? Replication: The primary server node copies data onto secondary server nodes. This can help increase data availability and act as a backup, in case if the primary server fails. Sharding: Handles horizontal scaling across servers using a shard key.

Is sharding possible in MongoDB?

Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations. Database systems with large data sets or high throughput applications can challenge the capacity of a single server.

Does sharding require a replica set?

Every shard server must always be deployed as a replica set. There must be a minimum of one shard in a sharded cluster, but to gain any benefits from sharding you will need at least two. The cluster's config server is a MongoDB instance that stores metadata and configuration settings for the sharded cluster.

How does sharding improve performance in MongoDB?

Sharding is MongoDB's way of supporting horizontal scaling. When you shard a MongoDB collection, the data is split across multiple server instances. This way, the same node is not queried in succession. The data is split on a particular field in the collection you've selected.


1 Answers

Sharding really isn't meant to be used this way. You should choose a shard key with some variation (or make a compound shard key) so that MongoDB can make reasonable-size chunks. One of the points of sharding is that your application doesn't have to know where your data is.

If you want to manually shard, you should do that: start unlinked MongoDB servers and route things yourself from the client side.

Finally, if you're really dedicated to this setup, you could migrate the chunk yourself (there's a moveChunk command).

The balancer moves chunks based on how much is mapped in memory (run serverStatus and look at the "mapped" field). It can take a while, MongoDB doesn't want your data flying all over the place in production, so it's pretty conservative.

Semi-unrelated answer: you can't do it atomically with sharding (eval isn't atomic across multiple servers). You'll have to do a findOne, insert, remove.

like image 152
kristina Avatar answered Oct 23 '22 04:10

kristina