MongoDB - single huge collection of raw data. Split or not?

Tags:

mongodb

We collect and store instrumentation data from a large number of hosts. Our storage is MongoDB - several shards with replicas. Everything is stored in a single large collection. Each document we insert is a time based observation with some attributes (measurements). The time stamp is the most important attribute because all queries are based on time at least. Documents are never updated, so it's a pure write-in-look-up model. Right now it works reasonably well with several billions of docs.

Now,

We want to grow a bit and hold up to 12 month of data which may amount to a scary trillion+ observations (documents). I was wandering if dumping everything into a single monstrous collection is the best choice or there is a more intelligent way to go about it. By more intelligent I mean - use less hardware while still providing fast inserts and (importantly) fast queries. So I thought about splitting the large collection into smaller pieces hoping to gain memory on indexes, insertion and query speed.

I looked into shards, but sharding by the time stamp sounds like a bad idea because all writes will go into one node canceling the benefits of sharding. The insert rates are pretty high, so we need sharding to work properly here. I also thought about creating a new collection every month and then pick up a relevant collection for a user query. Collections older than 12 month will be either dropped or archived. There is also an option to create entirely new database every month and do similar rotation. Other options? Or perhaps one large collection is THE option to grow real big?

Please share your experience and considerations in similar apps.

566

asked Apr 04 '13 16:04

Dima

2 Answers

It really depends on the use-case for your queries.

If it's something that could be aggregated, I would say do this through a scheduled map/reduce function and store the smaller data size in separate collection(s).

If everything should be in the same collection and all data should be queried at the same time to generate the desired results, then you need to go with Sharding. Then depending on the data size for your queries, you could go with an in memory map/reduce or even doing it at the application layer.

As yourself pointed out, Sharding based on time is a very bad idea. It makes all the writes going to one shard, so define your shard key. MongoDB Docs, has a very good explanation on this.

If you can elaborate more on your specific needs for the queries would be easier to suggest something.

Hope it helps.

107

answered Oct 01 '22 13:10

Majid

I think collection on monthly basis will help you to get some boost up but I was wondering why can not you use the hour field of your timestamp for sharding . You can add a column which will hold the HOUR part of time stamp and when you shard against it will be shared nicely as you have repeating hour daily basis. I have not tested it but thought it will may help you

answered Oct 01 '22 15:10

Devesh

Related questions
                            
                                Why is MongoDB slower to insert Records only when collection does not exist?
                            
                                MongoDB connection fails on multiple app servers
                            
                                MongoDB: not all the results are returned from a query, using $geoNear
                            
                                Incredibly slow query performance with $lookup and "sub" aggregation pipeline
                            
                                Set sort order for mongoose document
                            
                                difference between mongoclientoptions vs mongoclientsettings
                            
                                MongoDB C# Case Insensitive Sort and Index
                            
                                mongodb update the last array element
                            
                                Hacked database by using mongodb. Did anyone get this situation?
                            
                                InvalidOperationException: Record reader index out of sync
                            
                                Conditionally reduce two array fields in Mongo Aggregation
                            
                                This takes a long time...how do I speed this dictionary up? (python)
                            
                                MongoDb : Avoid excessive disk space
                            
                                How to remove a item from a list(ListField) by id in MongoEngine?
                            
                                MongoDB: what is the most efficient way to query a single random document?
                            
                                Ruby, Mongodb, Anemone: web crawler with possible memory leak?
                            
                                mongodb with rails, find by id in array
                            
                                pymongo: "OperationFailure: database error: error querying server"
                            
                                Mongoose complex (async) virtuals
                            
                                MongoDB/Java SDK: Query elements with a value in array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With