<h3>Problem Statement</h3> <p>I'm using python3 and trying to pickle a dictionary of IntervalTrees which weighs something like 2 to 3 GB. This is my console output:</p> <pre class="prettyprint"><code>10:39:25 - project: INFO - Checking if motifs file was generated by pickle... 10:39:25 - project: INFO - - Motifs file does not seem to have been generated by pickle, proceeding to parse... 10:39:38 - project: INFO - - Parse complete, constructing IntervalTrees... 11:04:05 - project: INFO - - IntervalTree construction complete, saving pickle file for next time. Traceback (most recent call last): File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle') File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb")) OSError: [Errno 22] Invalid argument </code></pre> <p>The line in which I attempt the save is</p> <pre class="prettyprint"><code>def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb")) </code></pre> <p>The error comes maybe 15 minutes after <code>save_as_pickled_object</code> is invoked (at 11:20).</p> <p>I tried this with a much smaller subsection of the motifs file and it worked fine, with all of the exact same code, so it must be an issue of scale. <strong>Are there any known bugs with pickle in python 3.6 relating to the scale of what you try to pickle? Are there known bugs with pickling large files in general?</strong> Are there any known ways around this? </p> <p>Thanks!</p> <h3>Update: This question might be a duplicate of Python 3 - Can pickle handle byte objects larger than 4GB? </h3> <h3>Solution</h3> <p>This is the code I used instead. </p> <pre class="prettyprint"><code>def save_as_pickled_object(obj, filepath): """ This is a defensive way to write pickle.write, allowing for very large files on all platforms """ max_bytes = 2**31 - 1 bytes_out = pickle.dumps(obj) n_bytes = sys.getsizeof(bytes_out) with open(filepath, 'wb') as f_out: for idx in range(0, n_bytes, max_bytes): f_out.write(bytes_out[idx:idx+max_bytes]) def try_to_load_as_pickled_object_or_None(filepath): """ This is a defensive way to write pickle.load, allowing for very large files on all platforms """ max_bytes = 2**31 - 1 try: input_size = os.path.getsize(filepath) bytes_in = bytearray(0) with open(filepath, 'rb') as f_in: for _ in range(0, input_size, max_bytes): bytes_in += f_in.read(max_bytes) obj = pickle.loads(bytes_in) except: return None return obj </code></pre>

<p>Alex, if I am not mistaken this bug report perfectly describes your issue.</p> <p>http://bugs.python.org/issue24658</p> <p>As a workaround, I think you can <code>pickle.dumps</code> instead of <code>pickle.dump</code> and then write to your file in chunks of size smaller than 2**31.</p>

Does pickle randomly fail with OSError on large files?

Problem Statement

I'm using python3 and trying to pickle a dictionary of IntervalTrees which weighs something like 2 to 3 GB. This is my console output:

10:39:25 - project: INFO - Checking if motifs file was generated by pickle...
10:39:25 - project: INFO -   - Motifs file does not seem to have been generated by pickle, proceeding to parse...
10:39:38 - project: INFO -   - Parse complete, constructing IntervalTrees...
11:04:05 - project: INFO -   - IntervalTree construction complete, saving pickle file for next time.
Traceback (most recent call last):
  File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file
    save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle')
  File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object
    def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
OSError: [Errno 22] Invalid argument

The line in which I attempt the save is

def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))

The error comes maybe 15 minutes after save_as_pickled_object is invoked (at 11:20).

I tried this with a much smaller subsection of the motifs file and it worked fine, with all of the exact same code, so it must be an issue of scale. Are there any known bugs with pickle in python 3.6 relating to the scale of what you try to pickle? Are there known bugs with pickling large files in general? Are there any known ways around this?

Thanks!

Update: This question might be a duplicate of Python 3 - Can pickle handle byte objects larger than 4GB?

Solution

This is the code I used instead.

def save_as_pickled_object(obj, filepath):
    """
    This is a defensive way to write pickle.write, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    bytes_out = pickle.dumps(obj)
    n_bytes = sys.getsizeof(bytes_out)
    with open(filepath, 'wb') as f_out:
        for idx in range(0, n_bytes, max_bytes):
            f_out.write(bytes_out[idx:idx+max_bytes])


def try_to_load_as_pickled_object_or_None(filepath):
    """
    This is a defensive way to write pickle.load, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    try:
        input_size = os.path.getsize(filepath)
        bytes_in = bytearray(0)
        with open(filepath, 'rb') as f_in:
            for _ in range(0, input_size, max_bytes):
                bytes_in += f_in.read(max_bytes)
        obj = pickle.loads(bytes_in)
    except:
        return None
    return obj

950

asked Mar 07 '17 16:03

Alex Lenail

1 Answers

Alex, if I am not mistaken this bug report perfectly describes your issue.

http://bugs.python.org/issue24658

As a workaround, I think you can pickle.dumps instead of pickle.dump and then write to your file in chunks of size smaller than 2**31.

140

answered Sep 28 '22 08:09

Giannis Spiliopoulos

Related questions
                            
                                Pandas rolling OLS being deprecated
                            
                                Use DB data model to generate SQLAlchemy models, schemas, and JSON response
                            
                                How can I re-calculate the common exponent?
                            
                                Fill Holes with Majority of Surrounding Values (Python)
                            
                                Vectorized Lookups of Pandas Series to a Dictionary
                            
                                How to iterate over a row in a SciPy sparse matrix?
                            
                                Usage of builtin sched module's non-blocking scheduler.run() method?
                            
                                How Do I Format a pandas timedelta object?
                            
                                How to set a custom field in a Django session model?
                            
                                Function is an object of class in python?
                            
                                HP QC REST API using python
                            
                                Mocking time issue in django test: time seems not to be frozen using freezegun
                            
                                How to retrieve the filename of an image with keras flow_from_directory shuffled method?
                            
                                Proper connection string to pass to sqlalchemy create_engine() for mysql AWS RDS
                            
                                Config-Class in Python
                            
                                How to animate matplotlib's drawgreatcircle function?
                            
                                Send/receive data with python socket
                            
                                is there any alternative to sys.getsizeof() in PyPy?
                            
                                How to skip blank lines with read_fwf in pandas?
                            
                                Timestamp roundtrip from Spark Python to Pandas and back

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does pickle randomly fail with OSError on large files?

Tags:

python

python-3.x

pickle