Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using cPickle to serialize a large dictionary causes MemoryError

I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.

The data model looks something like: {word : { doc_name : [location_list] } }

Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:

# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)

Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.

Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?

like image 755
Stephen Poletto Avatar asked Feb 18 '11 03:02

Stephen Poletto


1 Answers

cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles

like image 130
John La Rooy Avatar answered Sep 21 '22 19:09

John La Rooy