I am currently working on a MongoDB based system that will store at least a billion documents. This will increase by around 50 million each month.
The id of the main collection is of the form YYYYMM_SOURCEID_DOCTYPE_UUID and serves as the shard index. Each record results in about 1kb of index. 99% of operations will occur on the most recent three months of data. We would like to support keyword search of documents, with very good performance in the most recent three months of data and at least semi-decent performance on older stuff.
Does MongoDB sound like a reasonable solution as long as I can keep the active end of the index in memory?
I would suggest you change your shard key as with the current one it seems like you might go hit the last shard for everything as the YYYYMM bit of the key will make all new inserts go to the "most right" shard always. http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key#ChoosingaShardKey-Cardinality has some more info on that.
Depending on the cardinality of the "keywords" field, you might want to pick that as your shard key. This way, mongodb could easily fetch all documents belonging to a keyword from one shard. All writes will still go to all shards because it's partitioned by keyword.
If the cardinality of "keywords" is not very high (ie, < 100) then this is not a good shard key, however, you could combine it with year and month, such as keywords_YYYYMM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With