Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use large dicts in Python which not fit in memory?

We use a dict which contains about 4GB of data for data processing. It's convenient and fast.

The problem we are having is that this dict might grow over 32GB.

I'm looking for a way to use a dict (just like a variable with get()-method etc) which can be bigger than the available memory. It would be great if this dict somehow stored the data on disk and retrieved the data from disk when get(key) is called and value for the key is not in memory.

Preferably I wouldn't like to use an external service, like a SQL database.

I did find Shelve, but it seems to need the memory too.

Any ideas on how to approach this problem?

like image 725
ddofborg Avatar asked Dec 19 '13 15:12

ddofborg


People also ask

Do Python dictionaries take up a lot of memory?

The Python dictionary implementation consumes a surprisingly small amount of memory. But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first.

How big can a dictionary be in Python?

It will not display the output because the computer ran out of memory before reaching 2^27. So there is no size limitation in the dictionary.

Are Dicts faster than lists Python?

Analysis Of The Test Run ResultA dictionary is 6.6 times faster than a list when we lookup in 100 items.

How do I save a large dictionary in Python?

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.


3 Answers

That sounds like you could use a key-value-store which are currently hyped under the buzzword of No-SQL. Good introduction about it can be found, for instance in

http://ayende.com/blog/4449/that-no-sql-thing-key-value-stores.

It is simply a database with the API you described.

like image 94
jcklie Avatar answered Sep 20 '22 23:09

jcklie


I couldn't find any (fast) module to do this and decided to create my own (my first python project - thanks @blubber for some ideas :P). You can find it on GitHub: https://github.com/ddofborg/diskdict Comments are welcome!

like image 23
ddofborg Avatar answered Sep 20 '22 23:09

ddofborg


If you don't want to use an SQL database (which is a reasonable solution to a problem like this) you'll have to either figure out a way to compress the data you're working with or use a library like this one (or your own) to do the mapping to disc yourself.

You can also look at this question for some more strategies.

like image 28
maxywb Avatar answered Sep 21 '22 23:09

maxywb