I am using MongoDB as a temporary log store. The collection receives ~400,000 new rows an hour. Each row contains a UNIX timestamp and a JSON string.
Periodically I would like to copy the contents of the collection to a file on S3, creating a file for each hour containing ~400,000 rows (eg. today_10_11.log contains all the rows received between 10am and 11am). I need to do this copy while the collection is receiving inserts.
My question: what is the performance impact of having an index on the timestamp column on the 400,000 hourly inserts verses the additional time it will take to query an hours worth of rows.
The application in question is using written in Ruby running on Heroku and using the MongoHQ plugin.
Mongo indexes the _id field by default, and the ObjectId already starts with a timestamp, so basically, Mongo is already indexing your collection by insertion time for you. So if you're using the Mongo defaults, you don't need to index a second timestamp field (or even add one).
To get the creation time of an object id in ruby:
ruby-1.9.2-p136 :001 > id = BSON::ObjectId.new
=> BSON::ObjectId('4d5205ed0de0696c7b000001')
ruby-1.9.2-p136 :002 > id.generation_time
=> 2011-02-09 03:11:41 UTC
To generate the object ids for a given time:
ruby-1.9.2-p136 :003 > past_id = BSON::ObjectId.from_time(1.week.ago)
=> BSON::ObjectId('4d48cb970000000000000000')
So, for example, if you wanted to load all docs inserted in the past week, you'd simply search for _ids greater than past_id and less than id. So, through the Ruby driver:
collection.find({:_id => {:$gt => past_id, :$lt => id}}).to_a
=> #... a big array of hashes.
You can, of course, also add a separate field for timestamps, and index it, but there's no point in taking that performance hit when Mongo's already doing the necessary work for you with its default _id field.
More information on object ids.
I have an application like yours, and currently it has 150 million log records. At 400k an hour, this DB will get large fast. 400k inserts an hour with indexing on timestamp will be much more worthwhile than doing an unindexed query. I have no problem with inserting tens of millions of records in an hour with indexed timestamp, yet if I do an unindexed query on timestamp it takes a couple of minutes on a 4 server shard (cpu bound). Indexed query comes up instantly. So definitely index it, the write overhead on indexing is not that high and 400k records an hour is not much for mongo.
One thing you do have to look out for is memory size though. At 400k records an hour you are doing 10 million a day. That would consume about 350MB of memory a day to keep that index in memory. So if this goes for a while your index can get larger than memory fast.
Also, if you are truncating records after some time period using remove, I have found that removes create a large amount of IO to disk and it is disk bound.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With