Why is query router storing all of the data in mongodb cluster?

Question

I am trying to set up mongodb cluster, I have got 1 config server, 1 query router and 2 mongod instance. Here is my script to set up the cluster

mongod --configsvr --port 27010 --dbpath ~/mongodb/data1
mongos -configdb localhost:27010 --port 27011 
mongod --port 27012 --dbpath ~/mongodb/data2 
mongod --port 27013 --dbpath ~/mongodb/data3 

sh.addShard("localhost:27012")
sh.addShard("localhost:27013")

sh.enableSharding("tags")
db.tweets.ensureIndex( { _id : "hashed" } )
sh.shardCollection("tags.tweets", { "_id": "hashed" } )

In order to insert the data, I am using this script

connection = pymongo.MongoClient("mongodb://localhost:27011")
db=connection.tags
tweets = db.tweets


def main(jsonfile):
    f = open(jsonfile)

    for line in f.readlines():
        try:
            tweet_dict = json.loads(line)
            result = tweets.insert_one(tweet_dict)
            print result.inserted_id
        except  Exception as e:
            print "Unexpected error:", type(e), e
            sys.exit()

Why my tweets, which I am trying to insert, are getting sharded, all of the tweets I am trying to insert are also getting stored in query router. Is this behaviour expected?

The whole point of cluster is horizontal scalability(i.e. tweets getting split among machine), so for all of the tweets to accumulate in query router seems counter-intuitive?

Can anybody explain why it is happening? Why query router has all of the tweets I have inserted?

William Byrne III · Accepted Answer

You ask why your inserted tweets "are also getting stored in query router". The short answer is that the only copy of each document is stored on one of the underlying shard servers, and nothing is stored on the query router. The mongos process is not started with a --dbpath parameter, so it has nowhere to store data.

I set up an environment just like yours and then I then used a python script similar to yours to connect to the mongos (aka query router) and insert 200 documents to tags.tweets. Now when I connect to the mongos and count the documents in tags.tweets, it finds 200.

$> mongo --port 27011 tags
mongos> db.tweets.count()
200

However, when I run getShardDistribution it shows docs 91 on the first shard, and docs 109 docs on the second:

mongos> db.tweets.getShardDistribution()

 Shard shard0000 at localhost:27301
  data : 18KiB docs : 91 chunks : 2
  estimated data per chunk : 9KiB
  estimated docs per chunk : 45

 Shard shard0001 at localhost:27302
  data : 22KiB docs : 109 chunks : 2
  estimated data per chunk : 11KiB
  estimated docs per chunk : 54

 Totals
  data : 41KiB docs : 200 chunks : 4
  Shard shard0000 contains 45.41% data, 45.5% docs in cluster, avg obj size on shard : 210B
  Shard shard0001 contains 54.58% data, 54.5% docs in cluster, avg obj size on shard : 211B

How the query router works is it passes all commands to the underlying shard servers and then combines their responses before returning a result the to the caller. The count() of 200 returned above was just the sum of a count() done on each shard.

There's a lot more information about using MongoDB sharding for horizontal scalability in the sharding documentation here. You might find the section on metadata helpful for your current issue.

Why is query router storing all of the data in mongodb cluster?

Tags:

mongodb

Dude

1 Answers

William Byrne III

Recent Activity

Donate For Us

Why is query router storing all of the data in mongodb cluster?

Tags:

mongodb

Dude

1 Answers

William Byrne III

Related questions

Recent Activity

Donate For Us