I am trying to set up mongodb cluster, I have got 1 config server, 1 query router and 2 mongod instance. Here is my script to set up the cluster
mongod --configsvr --port 27010 --dbpath ~/mongodb/data1
mongos -configdb localhost:27010 --port 27011
mongod --port 27012 --dbpath ~/mongodb/data2
mongod --port 27013 --dbpath ~/mongodb/data3
sh.addShard("localhost:27012")
sh.addShard("localhost:27013")
sh.enableSharding("tags")
db.tweets.ensureIndex( { _id : "hashed" } )
sh.shardCollection("tags.tweets", { "_id": "hashed" } )
In order to insert the data, I am using this script
connection = pymongo.MongoClient("mongodb://localhost:27011")
db=connection.tags
tweets = db.tweets
def main(jsonfile):
f = open(jsonfile)
for line in f.readlines():
try:
tweet_dict = json.loads(line)
result = tweets.insert_one(tweet_dict)
print result.inserted_id
except Exception as e:
print "Unexpected error:", type(e), e
sys.exit()
Why my tweets, which I am trying to insert, are getting sharded, all of the tweets I am trying to insert are also getting stored in query router. Is this behaviour expected?
The whole point of cluster is horizontal scalability(i.e. tweets getting split among machine), so for all of the tweets to accumulate in query router seems counter-intuitive?
Can anybody explain why it is happening? Why query router has all of the tweets I have inserted?
You ask why your inserted tweets "are also getting stored in query router". The short answer is that the only copy of each document is stored on one of the underlying shard servers, and nothing is stored on the query router. The mongos process is not started with a --dbpath parameter, so it has nowhere to store data.
I set up an environment just like yours and then I then used a python script similar to yours to connect to the mongos (aka query router) and insert 200 documents to tags.tweets. Now when I connect to the mongos and count the documents in tags.tweets, it finds 200.
$> mongo --port 27011 tags
mongos> db.tweets.count()
200
However, when I run getShardDistribution it shows docs 91 on the first shard, and docs 109 docs on the second:
mongos> db.tweets.getShardDistribution()
Shard shard0000 at localhost:27301
data : 18KiB docs : 91 chunks : 2
estimated data per chunk : 9KiB
estimated docs per chunk : 45
Shard shard0001 at localhost:27302
data : 22KiB docs : 109 chunks : 2
estimated data per chunk : 11KiB
estimated docs per chunk : 54
Totals
data : 41KiB docs : 200 chunks : 4
Shard shard0000 contains 45.41% data, 45.5% docs in cluster, avg obj size on shard : 210B
Shard shard0001 contains 54.58% data, 54.5% docs in cluster, avg obj size on shard : 211B
How the query router works is it passes all commands to the underlying shard servers and then combines their responses before returning a result the to the caller. The count() of 200 returned above was just the sum of a count() done on each shard.
There's a lot more information about using MongoDB sharding for horizontal scalability in the sharding documentation here. You might find the section on metadata helpful for your current issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With