I've got a C++ program that's likely to generate a HUGE amount of data -- billions of binary records of varying sizes, most probably less than 256 bytes but a few stretching to several K. Most of the records will seldom be looked at by the program after they're created, but some will be accessed and modified regularly. There's no way to tell which are which when they're created.
Considering the volume of data, there's no way I can store it all in memory. But as the data only needs to be indexed and accessed by its number (a 64-bit integer), I don't want the overhead of a full-fledged database program. Ideally I'd like to treat it as an std::map
with its data stored on disk until requested.
Is there an already-written library that will do what I'm looking for, or do I need to write it myself?
EDIT: After some thought, I realized that Rob Walker's answer had a valid point: I'd be hard-pressed to get anywhere near the same kind of data integrity out of a home-brew class that I'd get from a real database.
Although BerkeleyDB (as suggested by RHM) looks like it would do exactly what we're looking for, the dual-licensing is a headache that we don't want to deal with. When we're done with the code and can prove that it would benefit noticeably from BerkeleyDB (which it probably would), we'll reexamine the issue.
I did look at Ferruccio's suggestion of stxxl, but I wasn't able to tell how it would handle the program being interrupted and restarted (maybe with changes). With that much data, I'd hate to just scrap what it had already completed and start over every time, if some of the data could be saved.
So we've decided to use an SQLite database, at least for the initial development. Thanks to everyone who answered or voted.
Take a look at STXXL.
stxxl::map<>
looks like it does exactly what you need.
I doubt you will find a library that meets your requirements exactly, so you'll have to decide on what 'features' are really important to you and then decide if an existing DB solution comes close enough.
Billions of records is a large dataset by any stretch. What rate are records generated at? How long do they persist? Does the access pattern change over time?
Are updates always with the same amount of data as the original?
I would suggest proving definitively that a DB solution isn't going to work before starting to roll your own, particularly if integrity of the data is paramount (and it usually is...) Maintaining that volume of data on disk reliably can definitely be a challenge. Do you need any kind of transaction semantics when changing the data? Is the client multithreaded?
BerkleyDB might be good for you. It indexes based on a string rather than a number, but you could format your number as hex. Supposed to be pretty much as fast as it gets for disk-based key/value lookup.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With