I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.
Now I am loading them with:
import pickle
with open('dict1.txt', "rb") as myFile:
dict1 = pickle.load(myFile)
However, already at loading the first dictionary I get:
*** set a breakpoint in malloc_error_break to debug
python(3716,0xa08ed1d4) malloc: *** mach_vm_map(size=1048576) failed (error code=3)
*** error: can't allocate region securely
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1019, in load_empty_dictionary
self.stack.append({})
MemoryError
How to solve this? My computer has 16GB of RAM so I find it unusual that loading a 800MB dictionary crashes. What I also find unusual is that there were no problems while saving the dictionaries.
Further, in future I plan to process more data resulting in larger dictionaries (3-4GB on the disk), so any advice how to improve the efficiency is appreciated.
“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
To use pickle, start by importing it in Python. To pickle this dictionary, you first need to specify the name of the file you will write it to, which is dogs in this case. Note that the file does not have an extension. To open the file for writing, simply use the open() function.
Python Pickle dump dump() function to store the object data to the file. pickle. dump() function takes 3 arguments. The first argument is the object that you want to store. The second argument is the file object you get by opening the desired file in write-binary (wb) mode.
Python pickle module is used for serializing and de-serializing a Python object structure. Any object in Python can be pickled so that it can be saved on disk. What pickle does is that it “serializes” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.)
If your data in the dictionaries are numpy
arrays, there are packages (such as joblib
and klepto
) that make pickling large arrays efficient, as both the klepto
and joblib
understand how to use minimal state representation for a numpy.array
. If you don't have array
data, my suggestion would be to use klepto
to store the dictionary entries in several files (instead of a single file) or to a database.
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
As the links above discuss, you could use klepto
-- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto
also enables you to pick a storage format (pickle
, json
, etc.) --also HDF5
(or a SQL database) is another good option as it allows parallel access. klepto
can utilize both specialized pickle formats (like numpy
's) and compression (if you care about size and not speed of accessing the data).
klepto
gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With