Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle database purging in Mongodb

I use mongodb for storing 30 day data which come to me as a stream. I am searching for a purging mechanism by which I can throw away oldest data to create room for new data. I used to use mysql in which I handled this situation using partitions. I kept 30 partitions which are date based. I delete the oldest dated partition and created a new partition to hold new data.

When I map the same thing in mongodb, I feel like using a date based 'shards'. But the problem is that it makes my data distribution bad. If all the new data are in the same shard, then that shard will be so hot as there are lot of people accessing them and the shards containing older data will be less loaded by users.

I can have a collection based purging. I can have 30 collections and I can throw away the oldest collection to accommodate new data. But couple of problems are 1) If I make collections smaller then I cannot benefit much from sharding as they are done per collection. 2) My queries have to change to query from all 30 collections and take an union.

Please suggest me a good purging mechanism (if any) to handle this situation.

like image 957
user472402 Avatar asked Jan 18 '12 04:01

user472402


2 Answers

There are really only three ways to do purging in MongoDB. It looks like you've already identified several of the trade-offs.

  1. Single collection, delete old entries
  2. Collection per day, drop old collections
  3. Database per day, drop old databases

Option #1: single collection

pros

  • Easy to implement
  • Easy to run Map/Reduces

cons

  • Deletes are as expensive as inserts, causes lots of IO and the need to "defragment" or "compact" the DB.
  • At some point you end up handling double the "writes" as you have to both insert a day's worth of data and delete a day's worth of data.

Option #2: collection per day

pros

  • Removing data via collection.drop() is very fast.
  • Still Map/Reduce friendly as the output from each day can be merged or re-reduced against the summary data.

cons

  • You may still have some fragmenting problems.
  • You will need to re-write queries. However, in my experience if you have enough data that you're purging, you rarely access that data directly. Instead you tend to run Map/Reduces over that data. So this may not change that many queries.

Option #3: database per day

pros

  • Deletion is as fast as possible, files are simply truncated.
  • Zero fragmentation problems and easy to backup / restore / archive old data.

cons

  • Will make querying more challenge (expect to write some wrapper code).
  • Not as easy to write Map/Reduce's, though take a look at the Aggregation Framework as that may better satisfy your needs anyways.

Now there is an option #4, but it is not a general solution. I know of some people who did "purging" by simply using Capped Collections. There are definitely cases where this works, but it has a bunch of caveats, so you really need to know what you're doing.

like image 89
Gates VP Avatar answered Oct 16 '22 11:10

Gates VP


we can set TTL for collection from mongodb 2.2 release or higher. this will help you to expire old data from collection.

Follow this link: http://docs.mongodb.org/manual/tutorial/expire-data/

like image 22
geek Avatar answered Oct 16 '22 11:10

geek