I have a big dictionary(28 MB) 'MyDict' stored in a MyDict.py
file.
If I execute the statement:
from MyDict import MyDict
A MemoryError
exception is thrown.
How can I access this dictionary using cPickle
or shelve
modules.
How can I write this MyDict.py
file to cPickle
or shelve
without accessing MyDict.
This MyDict is generated by writing into a file. Here is a key-value pair from the dictionary:
{"""ABCD""" : [[(u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-23', 36.9, 36.9, 35.25, 36.1, 456.0, 36.1)],
[(u'2011-03-18', 37.0, 38.0, 36.5, 36.5, 861.0, 36.5), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
[(u'2011-03-16', 37.0, 37.9, 36.3, 36.7, 3876.0, 36.7), (u'2012-03-03', 86.0, 87.95, 85.55, 86.2, 30587.0, 86.2), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
[(u'2010-12-09', 40.5, 41.95, 36.3, 36.75, 42943.0, 36.75), (u'2011-10-26', 67.95, 71.9, 66.45, 70.35, 180812.0, 70.35), (u'2011-03-21', 35.5, 37.5, 35.3, 35.85, 10434.0, 35.85)],
[(u'2009-01-16', 14.75, 15.0, 14.0, 14.15, 14999.0, 14.05), (u'2010-01-11', 50.0, 52.8, 49.0, 50.95, 174826.0, 50.95), (u'2009-01-27', 14.3, 15.0, 13.9, 14.15, 3862.0, 14.15)]]}
shelve
is actually a pretty good choice here. It acts just like a dictionary, but it's backed by a BDB (or similar) key-value database file, and Python will handle all the caching, etc., so it doesn't need to load the whole thing into memory at once.
Here's how to create the shelve file. Note that shelf keys have to be strings. Also note that I'm creating the shelf in-place, rather than first creating a dict
and shelving it. That way you avoid the cost of having to build that giant in-memory dict
that was causing problems in the first place.
from contextlib import closing
import shelve
def makedict(shelf):
# Put the real dict-generating code here, obviously
for i in range(500000);
shelf[str(i)] = i
with closing(shelve.open('mydict.shelf', 'c')) as shelf:
makedict(shelf)
And to use it, don't actually read it in; leave it as an on-disk shelf:
from contextlib import closing
import shelve
with closing(shelve.open('mydict.shelf')) as d:
# Put all your actual work here.
print len(d)
If your dictionary-using code doesn't fit easily into a scope, replace the with
statement with a plain open
, and explicitly close
it when you're done.
pickle
is probably not as good of an idea, because you still have to read the whole thing into memory. It will probably use a lot less transient memory, and maybe disk space, than importing a module that defines a giant literal, but still, having an in-memory hash table that huge could still be a problem. But you can always test it and see how well it works.
Here's how to create the pickle file. Note that you can use (nearly) anything you want as a key, not just strings. However, you have to build the whole dict
before you can pickle
it.
import cPickle
def makedict():
# Put the real dict-generating code here, obviously
return {i:i for i in range(500000)}
with open('mydict.pickle', 'wb') as f:
cPickle.dump(d, f, -1)
This creates a 47MB file.
Now, to use it in your main app:
import cPickle
def loaddict():
with open('mydict.pickle', 'rb') as f:
return cPickle.load(f)
The same basic problems with pickle
go for any other persistence format that has to be saved and loaded—whether something custom that you write yourself, or something standard like JSON or YAML. (Of course if you need interoperability with other programs, especially in other languages, something like JSON is the way to go.) You're better off with a database; the only question is, what kind of database.
The advantage of an anydbm
type database is that you can use it as if it were a dict
, without worrying about how to load/save/access it (other than the open
and close
lines). The problem with anydbm
is that it only lets you map strings to strings.
The shelve
module effectively wraps anydbm
, with pickling of each value. Your keys still have to be strings, but your values can be almost anything. So as long as your keys are strings, and you don't have any references from the values to external objects, it's a pretty transparent drop-in replacement for a dict
.
The other options—sqlite3
, various modern nosql databases, etc.—require you to change the way you access data, and even the way you organize it. (A "list of lists" isn't a clear ER model.) Of course in the long run, this might result in a better design, so if you think you really should be using a relational model, consider this idea.
From the comments, @ekta wanted me to explain why some of the restrictions on dbm
and shelve
exist.
First, dbm
goes back to the 70s. A database that could map 8-bit strings to strings simply and efficiently was a pretty huge deal back then. It was also pretty common for values of all kinds to be stored as their string representation—or, if not that, then just storing the bytes that happen to represent the value natively on the current machine. (XML, JSON, or even endianness-swapping may have been too expensive for the machines of the day, or at least the thinking of the day.)
Extending dbm
to handle other data types for the values isn't hard. They never need to be hashed or compared, just stored and retrieved losslessly. Since pickle
can handle a very wide variety of types, isn't too horribly inefficient, and comes with Python, it makes sense to use pickle
for that, so shelve
does exactly that.
But the keys are a different story. You need an encoding that's not only losslessly reversible, but also ensures that two values will encode to equal bytes if and only if they're actually equal. Keep in mind that in Python, 1 == True
, but obviously pickle.dumps(1) != pickle.dumps(True)
, b'1' != b'True'
, etc.
There are plenty of types that can be losslessly and equality-preservingly converted to bytes if you only care about that type. For example, for Unicode strings, just use UTF-8. (Actually, shelve
takes care of that one for you.) For 32-bit signed integers, use struct.pack('>I')
. For tuples of three strings, encode to UTF-8, backslash-escape, and join them with newlines. And so on. For many specific domains, there's an easy answer; there's just no general-purpose answer that works for most domains.
So, if you want to use a dbm
to use, say, tuples of three UTF-8 strings as keys, you can write your own wrapper around dbm
(or shelve
). As with many modules in the stdlib, shelve
is intended to be helpful sample code as well as a usable feature, which is why the docs have a link to the source. It's simple enough that a novice should be able to figure out how to either fork it, subclass it, or wrap it to do his own custom key encoding. (Note that if you wrap shelve
, you will have to encode your custom values to str
just so it can encode that str
to bytes
; if you fork it, or subclass it and override the relevant methods, you can instead encode directly to bytes
—e.g., that struct.pack
call above. This may be better for both simplicity/readability and performance.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With