MemoryError with Pickle in Python

Tags:

I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.

Now I am loading them with:

import pickle
with open('dict1.txt', "rb") as myFile:
    dict1 = pickle.load(myFile)

However, already at loading the first dictionary I get:

*** set a breakpoint in malloc_error_break to debug
python(3716,0xa08ed1d4) malloc: *** mach_vm_map(size=1048576) failed (error code=3)
*** error: can't allocate region securely
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1019, in load_empty_dictionary
    self.stack.append({})
MemoryError

How to solve this? My computer has 16GB of RAM so I find it unusual that loading a 800MB dictionary crashes. What I also find unusual is that there were no problems while saving the dictionaries.

Further, in future I plan to process more data resulting in larger dictionaries (3-4GB on the disk), so any advice how to improve the efficiency is appreciated.

720

asked Jan 21 '15 13:01

flotr

1 Answers

If your data in the dictionaries are numpy arrays, there are packages (such as joblib and klepto) that make pickling large arrays efficient, as both the klepto and joblib understand how to use minimal state representation for a numpy.array. If you don't have array data, my suggestion would be to use klepto to store the dictionary entries in several files (instead of a single file) or to a database.

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.

As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) --also HDF5 (or a SQL database) is another good option as it allows parallel access. klepto can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed of accessing the data).

klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.

133

answered Sep 30 '22 04:09

Mike McKerns

Related questions
                            
                                How to call a Python function from Lua?
                            
                                Django haystack EdgeNgramField given different results than elasticsearch
                            
                                GeoDjango LayerMapping & Foreign Key
                            
                                Scikit-learn χ² (chi-squared) statistic and corresponding contingency table
                            
                                Python: How to check if path is a subpath [duplicate]
                            
                                += with multiple variables in python [closed]
                            
                                Scapy error: no module names pcapy
                            
                                Adding row to pandas DataFrame changes dtype
                            
                                Can't solve Python argparse error 'object has no attribute'
                            
                                python: doctest my github-markdown files?
                            
                                Why does pandas return timestamps instead of datetime objects when calling pd.to_datetime()?
                            
                                Pandas DataFrame cast multiple types to columns
                            
                                Is there a way to change Python's open() default text encoding?
                            
                                R dcast equivalent in python pandas
                            
                                matplotlib scatter plot with different markers and colors
                            
                                How to print all digits of a large number in python?
                            
                                Run mod_wsgi with virtualenv or Python with version different that system default
                            
                                What kind of python magic does dir() perform with __getattr__?
                            
                                How do I block python RuntimeWarning from printing to the terminal? [duplicate]
                            
                                How to split a Python generator of tuples into 2 separate generators?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MemoryError with Pickle in Python

Tags:

python

dictionary

memory

memory-leaks

pickle

flotr

People also ask

1 Answers

Mike McKerns

Recent Activity

Donate For Us