Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDb Database vs Collection

Tags:

mongodb

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).

What is the best strategy of design?

  1. Dump all records in single collection

  2. Have a collection for each user

  3. Have a database for each user.

Many Thanks,

like image 267
DafaDil Avatar asked Dec 18 '12 16:12

DafaDil


People also ask

What is the difference between database and collection in MongoDB?

Instead of tables, a MongoDB database stores its data in collections. A collection holds one or more BSON documents. Documents are analogous to records or rows in a relational database table. Each document has one or more fields; fields are similar to the columns in a relational database table.

What is the difference between a database and a collection?

A Database contains a collection, and a collection contains documents and the documents contain data, they are related to each other.

Is collection a database in MongoDB?

A collection in MongoDB is a group of documents. Collections in a NoSQL database like MongoDB correspond to tables in relational database management systems (RDBMS) or SQL databases. As such, collections don't enforce a set schema, and documents within a single collection can have widely different fields.

What is a collection MongoDB?

A collection is a grouping of MongoDB documents. Documents within a collection can have different fields. A collection is the equivalent of a table in a relational database system. A collection exists within a single database.


2 Answers

So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).

The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.

Therefore the answer to your question is put all your records in a single sharded collection.

The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.

I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster

like image 107
chrisbunney Avatar answered Sep 29 '22 06:09

chrisbunney


So you are looking for 100,000,000 detail records overall for 100K users?

What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.

So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.

MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.

So a collection per user is not feasible at all. It would be using MongoDB against its core principles.

Having a database per user involves the same problems, maybe more, as having singular collections per user.

I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.

So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).

If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.

like image 28
Sammaye Avatar answered Sep 29 '22 05:09

Sammaye