Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB: BIllions of documents in a collection

Tags:

mongodb

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.

Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?

Would sharding help? Should I try and split the data set over many collections and build that logic into my application?

like image 887
Elliot Chance Avatar asked Jul 04 '12 00:07

Elliot Chance


People also ask

How many documents can a collection have in MongoDB?

If you specify a maximum number of documents for a capped collection using the max parameter to create, the limit must be less than 2^32 documents. If you do not specify a maximum number of documents when creating a capped collection, there is no limit on the number of documents.

How does MongoDB define maximum number of documents in collection?

Unless you create a capped collection, there is no limit to the number of documents in a collection. There is a 16MB limit to the size of a document (use gridfs in this situation) and limits in the storage engine for the size of the database and data.

How many documents is too many in MongoDB?

To my knowledge, there's no real 'limit' on the number of docs in a collection.. probably, it is the number of unique combinations of _id field MongoDB can generate..But that would be much larger than 500K..

How does MongoDB calculate collection size?

collection. totalSize() method is used to reports the total size of a collection, including the size of all documents and all indexes on a collection. Returns: The total size in bytes of the data in the collection plus the size of every index on the collection.


2 Answers

It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.

Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.

Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.

It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.

However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.

You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.

Here's some good links -

  • Choosing a Shard Key
  • Blog post on shard keys
  • Overview presentation on sharding
  • Presentation on Sharding Best Practices
like image 68
Mark Hillick Avatar answered Oct 04 '22 12:10

Mark Hillick


You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.

For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

like image 29
Eric J. Avatar answered Oct 04 '22 12:10

Eric J.