I'm working on a project where we periodically collect large quantities of e-mail via IMAP or POP, perform analysis on it (such as clustering into conversations, extracting important sentences etc.), and then present views via the web to the end user.
The main view will be a facebook-like profile page for each contact of the the most recent (20 or so) conversations that each of them have had from the e-mail we capture.
For us, it's important to be able to retrieve the profile page and recent 20 items frequently and quickly. We may also be frequently inserting recent e-mails into this feed. For this, document storage and MongoDB's low-cost atomic writes seem pretty attractive.
However we'll also have a LARGE volume of old e-mail conversations that won't be frequently accessed (since they won't appear in the most recent 20 items, folks will only see them if they search for them, which will be relatively rare). Furthermore, the size of this data will grow more quickly than the contact store over time.
From what I've read, MongoDB seems to more or less require the entire data set to remain in RAM, and the only way to work around this is to use virtual memory, which can carry a significant overhead. Particularly if Mongo isn't able to differentiate between the volatile data (profiles/feeds) and non-volatile data (old emails), this could end up being quite nasty (and since it seems to devolve the virtual memory allocation to the OS, I don't see how the this would be possible for Mongo to do).
It would seem that the only choices are to either (a) buy enough RAM to store everything, which is fine for the volatile data, but hardly cost efficient for capturing TB of e-mails, or (b) use virtual memory and see reads/writes on our volatile data slow to a crawl.
Is this correct, or am I missing something? Would MongoDB be a good fit for this particular problem? If so, what would the configuration look like?
MongoDB does not "require the entire data set to remain in RAM". See http://www.mongodb.org/display/DOCS/Caching for an explanation as to why/how it uses virtual memory the way it does.
It would be fine for this application. If your sorting and filtering were more complex you might, for example, want to use a Map-Reduce operation to create a collection that's "display ready" but for a simple date ordered set the existing indexes will work just fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With