I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.
The data model looks something like: {word : { doc_name : [location_list] } }
Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:
# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)
Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.
Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?
cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With