Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store a big dictionary?

I have a big dictionary(28 MB) 'MyDict' stored in a MyDict.py file.

If I execute the statement:

from MyDict import MyDict

A MemoryError exception is thrown.

How can I access this dictionary using cPickle or shelve modules.

How can I write this MyDict.py file to cPickle or shelve without accessing MyDict.

This MyDict is generated by writing into a file. Here is a key-value pair from the dictionary:

{"""ABCD""" : [[(u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-23', 36.9, 36.9, 35.25, 36.1, 456.0, 36.1)],
    [(u'2011-03-18', 37.0, 38.0, 36.5, 36.5, 861.0, 36.5), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
    [(u'2011-03-16', 37.0, 37.9, 36.3, 36.7, 3876.0, 36.7), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
    [(u'2010-12-09', 40.5, 41.95, 36.3, 36.75, 42943.0, 36.75), (u'2011-10-26', 67.95, 71.9, 66.45, 70.35, 180812.0, 70.35), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
    [(u'2009-01-16', 14.75, 15.0, 14.0, 14.15, 14999.0, 14.05), (u'2010-01-11', 50.0, 52.8, 49.0, 50.95, 174826.0, 50.95), (u'2009-01-27', 14.3, 15.0, 13.9, 14.15, 3862.0, 14.15)]]}
like image 263
Vinod Avatar asked Dec 21 '22 12:12

Vinod


1 Answers

shelve is actually a pretty good choice here. It acts just like a dictionary, but it's backed by a BDB (or similar) key-value database file, and Python will handle all the caching, etc., so it doesn't need to load the whole thing into memory at once.

Here's how to create the shelve file. Note that shelf keys have to be strings. Also note that I'm creating the shelf in-place, rather than first creating a dict and shelving it. That way you avoid the cost of having to build that giant in-memory dict that was causing problems in the first place.

from contextlib import closing
import shelve

def makedict(shelf):
    # Put the real dict-generating code here, obviously
    for i in range(500000);
        shelf[str(i)] = i

with closing(shelve.open('mydict.shelf', 'c')) as shelf:
    makedict(shelf)

And to use it, don't actually read it in; leave it as an on-disk shelf:

from contextlib import closing
import shelve

with closing(shelve.open('mydict.shelf')) as d:
    # Put all your actual work here.
    print len(d)

If your dictionary-using code doesn't fit easily into a scope, replace the with statement with a plain open, and explicitly close it when you're done.

pickle is probably not as good of an idea, because you still have to read the whole thing into memory. It will probably use a lot less transient memory, and maybe disk space, than importing a module that defines a giant literal, but still, having an in-memory hash table that huge could still be a problem. But you can always test it and see how well it works.

Here's how to create the pickle file. Note that you can use (nearly) anything you want as a key, not just strings. However, you have to build the whole dict before you can pickle it.

import cPickle

def makedict():
    # Put the real dict-generating code here, obviously
    return {i:i for i in range(500000)}

with open('mydict.pickle', 'wb') as f:
    cPickle.dump(d, f, -1)

This creates a 47MB file.

Now, to use it in your main app:

import cPickle

def loaddict():
    with open('mydict.pickle', 'rb') as f:
        return cPickle.load(f)

The same basic problems with pickle go for any other persistence format that has to be saved and loaded—whether something custom that you write yourself, or something standard like JSON or YAML. (Of course if you need interoperability with other programs, especially in other languages, something like JSON is the way to go.) You're better off with a database; the only question is, what kind of database.

The advantage of an anydbm type database is that you can use it as if it were a dict, without worrying about how to load/save/access it (other than the open and close lines). The problem with anydbm is that it only lets you map strings to strings.

The shelve module effectively wraps anydbm, with pickling of each value. Your keys still have to be strings, but your values can be almost anything. So as long as your keys are strings, and you don't have any references from the values to external objects, it's a pretty transparent drop-in replacement for a dict.

The other options—sqlite3, various modern nosql databases, etc.—require you to change the way you access data, and even the way you organize it. (A "list of lists" isn't a clear ER model.) Of course in the long run, this might result in a better design, so if you think you really should be using a relational model, consider this idea.


From the comments, @ekta wanted me to explain why some of the restrictions on dbm and shelve exist.

First, dbm goes back to the 70s. A database that could map 8-bit strings to strings simply and efficiently was a pretty huge deal back then. It was also pretty common for values of all kinds to be stored as their string representation—or, if not that, then just storing the bytes that happen to represent the value natively on the current machine. (XML, JSON, or even endianness-swapping may have been too expensive for the machines of the day, or at least the thinking of the day.)

Extending dbm to handle other data types for the values isn't hard. They never need to be hashed or compared, just stored and retrieved losslessly. Since pickle can handle a very wide variety of types, isn't too horribly inefficient, and comes with Python, it makes sense to use pickle for that, so shelve does exactly that.

But the keys are a different story. You need an encoding that's not only losslessly reversible, but also ensures that two values will encode to equal bytes if and only if they're actually equal. Keep in mind that in Python, 1 == True, but obviously pickle.dumps(1) != pickle.dumps(True), b'1' != b'True', etc.

There are plenty of types that can be losslessly and equality-preservingly converted to bytes if you only care about that type. For example, for Unicode strings, just use UTF-8. (Actually, shelve takes care of that one for you.) For 32-bit signed integers, use struct.pack('>I'). For tuples of three strings, encode to UTF-8, backslash-escape, and join them with newlines. And so on. For many specific domains, there's an easy answer; there's just no general-purpose answer that works for most domains.

So, if you want to use a dbm to use, say, tuples of three UTF-8 strings as keys, you can write your own wrapper around dbm (or shelve). As with many modules in the stdlib, shelve is intended to be helpful sample code as well as a usable feature, which is why the docs have a link to the source. It's simple enough that a novice should be able to figure out how to either fork it, subclass it, or wrap it to do his own custom key encoding. (Note that if you wrap shelve, you will have to encode your custom values to str just so it can encode that str to bytes; if you fork it, or subclass it and override the relevant methods, you can instead encode directly to bytes—e.g., that struct.pack call above. This may be better for both simplicity/readability and performance.)

like image 133
abarnert Avatar answered Jan 07 '23 04:01

abarnert