Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up reading multiple pickle files

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.

Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.

Using threading wouldn't help either because of GIL.

I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

UPDATE Answering your questions:

  • Files are partial products of data processing for the purpose of ML
  • There are pandas.Series objects but the dtype is not known upfront
  • I want to have many files because we want to pick any subset easily
  • I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
  • The size of the files can vary a lot
  • I use python 3.7 so I believe it's cPickle in fact
  • Using pickle is very flexible because I don't have to worry about underlying types - I can save anything
like image 333
user2146414 Avatar asked Feb 24 '21 09:02

user2146414


People also ask

How do I read multiple pickle files in Python?

Use pickle.Call open(file_name, mode) with mode as "wb" or "rb" to return a writable or readable object of the file of file_name , respectively. Call open_file. close() to close the file open_file .

Is pickle more efficient than CSV?

Pickle is around 11 times faster this time, when not compressed. The compression is a huge pain point when reading and saving files. But, let's see how much disk space does it save. The file size decrease when compared to CSV is significant, but the compression doesn't save that much disk space in this case.

How do I reduce the time taken to load a pickle file in Python?

Try using the json library instead of pickle . This should be an option in your case because you're dealing with a dictionary which is a relatively simple object. According to this website, JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).


2 Answers

I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.

That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.

like image 142
hrokr Avatar answered Sep 29 '22 22:09

hrokr


I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).

If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.

Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil

import pickle, shutil, os

#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}

#create two pickles
with open('test1.pickle', 'wb') as f:
    pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
    pickle.Pickler(f).dump(d2)
    
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
    for pickle_file in ['test1.pickle', 'test2.pickle']:
        with open(pickle_file, 'rb') as src:
            shutil.copyfileobj(src, dst)
            
#unpack the data
with open('test3.pickle', 'rb') as f:
    p = pickle.Unpickler(f)
    while True:
        try:
            print(p.load())
        except EOFError:
            break
        
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')
like image 32
Aaron Avatar answered Sep 29 '22 22:09

Aaron