Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does pickle randomly fail with OSError on large files?

Problem Statement

I'm using python3 and trying to pickle a dictionary of IntervalTrees which weighs something like 2 to 3 GB. This is my console output:

10:39:25 - project: INFO - Checking if motifs file was generated by pickle...
10:39:25 - project: INFO -   - Motifs file does not seem to have been generated by pickle, proceeding to parse...
10:39:38 - project: INFO -   - Parse complete, constructing IntervalTrees...
11:04:05 - project: INFO -   - IntervalTree construction complete, saving pickle file for next time.
Traceback (most recent call last):
  File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file
    save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle')
  File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object
    def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
OSError: [Errno 22] Invalid argument

The line in which I attempt the save is

def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))

The error comes maybe 15 minutes after save_as_pickled_object is invoked (at 11:20).

I tried this with a much smaller subsection of the motifs file and it worked fine, with all of the exact same code, so it must be an issue of scale. Are there any known bugs with pickle in python 3.6 relating to the scale of what you try to pickle? Are there known bugs with pickling large files in general? Are there any known ways around this?

Thanks!

Update: This question might be a duplicate of Python 3 - Can pickle handle byte objects larger than 4GB?

Solution

This is the code I used instead.

def save_as_pickled_object(obj, filepath):
    """
    This is a defensive way to write pickle.write, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    bytes_out = pickle.dumps(obj)
    n_bytes = sys.getsizeof(bytes_out)
    with open(filepath, 'wb') as f_out:
        for idx in range(0, n_bytes, max_bytes):
            f_out.write(bytes_out[idx:idx+max_bytes])


def try_to_load_as_pickled_object_or_None(filepath):
    """
    This is a defensive way to write pickle.load, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    try:
        input_size = os.path.getsize(filepath)
        bytes_in = bytearray(0)
        with open(filepath, 'rb') as f_in:
            for _ in range(0, input_size, max_bytes):
                bytes_in += f_in.read(max_bytes)
        obj = pickle.loads(bytes_in)
    except:
        return None
    return obj
like image 950
Alex Lenail Avatar asked Mar 07 '17 16:03

Alex Lenail


People also ask

What is pickling in serialization?

“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.

Why is pickle insecure?

The insecurity is not because pickles contain code, but because they create objects by calling constructors named in the pickle. Any callable can be used in place of your class name to construct objects. Malicious pickles will use other Python callables as the “constructors.” For example, instead of executing “models.

Does pickle dump overwrite or append?

Pickle dump replaces current file data.

Are pickles efficient?

The advantage of using pickle is that it can serialize pretty much any Python object, without having to add any extra code. Its also smart in that in will only write out any single object once, making it effective to store recursive structures like graphs.


1 Answers

Alex, if I am not mistaken this bug report perfectly describes your issue.

http://bugs.python.org/issue24658

As a workaround, I think you can pickle.dumps instead of pickle.dump and then write to your file in chunks of size smaller than 2**31.

like image 140
Giannis Spiliopoulos Avatar answered Sep 28 '22 08:09

Giannis Spiliopoulos