We use a dict
which contains about 4GB of data for data processing. It's convenient and fast.
The problem we are having is that this dict
might grow over 32GB.
I'm looking for a way to use a dict
(just like a variable with get()-method etc) which can be bigger than the available memory. It would be great if this dict
somehow stored the data on disk and retrieved the data from disk when get(key)
is called and value for the key
is not in memory.
Preferably I wouldn't like to use an external service, like a SQL database.
I did find Shelve, but it seems to need the memory too.
Any ideas on how to approach this problem?
The Python dictionary implementation consumes a surprisingly small amount of memory. But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first.
It will not display the output because the computer ran out of memory before reaching 2^27. So there is no size limitation in the dictionary.
Analysis Of The Test Run ResultA dictionary is 6.6 times faster than a list when we lookup in 100 items.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
That sounds like you could use a key-value-store which are currently hyped under the buzzword of No-SQL. Good introduction about it can be found, for instance in
http://ayende.com/blog/4449/that-no-sql-thing-key-value-stores.
It is simply a database with the API you described.
I couldn't find any (fast) module to do this and decided to create my own (my first python project - thanks @blubber for some ideas :P). You can find it on GitHub: https://github.com/ddofborg/diskdict Comments are welcome!
If you don't want to use an SQL database (which is a reasonable solution to a problem like this) you'll have to either figure out a way to compress the data you're working with or use a library like this one (or your own) to do the mapping to disc yourself.
You can also look at this question for some more strategies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With