Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the max size of collection in mongodb

I would like to know what is the max size of collection in mongodb. In mongodb limitations documentation it is mentioned single MMAPv1 database has a maximum size of 32TB.

This means max size of collection is 32TB? If I want to store more than 32TB in one collection what is the solution?

like image 459
Aravind Kumar Anugula Avatar asked Nov 26 '15 13:11

Aravind Kumar Anugula


People also ask

How many items can be in a MongoDB collection?

There is no limitation as you can see here: If you specify a maximum number of documents for a capped collection using the max parameter to create, the limit must be less than 2^32 documents.

How do I increase the size of a collection in MongoDB?

Show activity on this post. db. runCommand({"convertToCapped": "events", size: 10 * 1024 * 1024}); Will set the byte size of the capped collection (larger or smaller).

Can MongoDB handle millions of data?

Working with MongoDB and ElasticSearch is an accurate decision to process millions of records in real-time. These structures and concepts could be applied to larger datasets and will work extremely well too.

How do I find the size of a collection in MongoDB?

collection. totalSize() method is used to reports the total size of a collection, including the size of all documents and all indexes on a collection. Returns: The total size in bytes of the data in the collection plus the size of every index on the collection.


1 Answers

There are theoretical limits, as I will show below, but even the lower bound is pretty high. It is not easy to calculate the limits correctly, but the order of magnitude should be sufficient.

mmapv1

The actual limit depends on a few things like length of shard names and alike (that sums up if you have a couple of hundred thousands of them), but here is a rough calculation with real life data.

Each shard needs some space in the config db, which is limited as any other database to 32TB on a single machine or in a replica set. On the servers I administrate, the average size of an entry in config.shards is 112 bytes. Furthermore, each chunk needs about 250 bytes of metadata information. Let us assume optimal chunk sizes of close to 64MB.

We can have at maximum 500,000 chunks per server. 500,000 * 250byte equals 125MB for the chunk information per shard. So, per shard, we have 125.000112 MB per shard if we max everything out. Dividing 32TB by that value shows us that we can have a maximum of slightly under 256,000 shards in a cluster.

Each shard in turn can hold 32TB worth of data. 256,000 * 32TB is 8.19200 exabytes or 8,192,000 terabytes. That would be the limit for our example.

Let's say its 8 exabytes. As of now, this can easily translated to "Enough for all practical purposes". To give you an impression: All data held by the Library of Congress (arguably one of the biggest library in the world in terms of collection size) holds an estimated size of data of around 20TB in size including audio, video, and digital materials. You could fit that into our theoretical MongoDB cluster some 400,000 times. Note that this is the lower bound of the maximum size, using conservative values.

WiredTiger

Now for the good part: The WiredTiger storage engine does not have this limitation: The database size is not limited (since there is no limit on how many datafiles can be used), so we can have an unlimited number of shards. Even when we have those shards running on mmapv1 and only our config servers on WT, the size of a becomes nearly unlimited – the limitation to 16.8M TB of RAM on a 64 bit system might cause problems somewhere and cause the indices of the config.shard collection to be swapped to disk, stalling the system. I can only guess, since my calculator refuses to work with numbers in that area (and I am too lazy to do it by hand), but I estimate the limit here in the two digit yottabyte area (and the space needed to host that somewhere in the size of Texas).

Conclusion

Do not worry about the maximum data size in a sharded environment. No matter what, it is by far enough, even with the most conservative approach. Use sharding, and you are done. Btw: even 32TB is a hell lot of data: Most clusters I know hold less data and shard because the IOPS and RAM utilization exceeded a single nodes capacity.

like image 131
Markus W Mahlberg Avatar answered Oct 11 '22 23:10

Markus W Mahlberg