Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to pickle a pandas DataFrame?

Which is better, using Pandas built-in method or pickle.dump?

The standard pickle method looks like this:

pickle.dump(my_dataframe, open('test_pickle.p', 'wb'))

The Pandas built-in method looks like this:

my_dataframe.to_pickle('test_pickle.p')
like image 656
tegan Avatar asked Feb 26 '15 23:02

tegan


People also ask

Can I pickle a Pandas DataFrame?

Pandas DataFrame: to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file.

Is PyArrow faster than Pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.

Is pickle more efficient than CSV?

Pickle: Pickle is the native format of python that is popular for object serialization. The advantage of pickle is that it allows the python code to implement any type of enhancements. It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques.


1 Answers

Thanks to @qwwqwwq I discovered that pandas has a built-in to_pickle method for dataframes. I did a quick time test:

In [1]: %timeit pickle.dump(df, open('test_pickle.p', 'wb'))
10 loops, best of 3: 91.8 ms per loop

In [2]: %timeit df.to_pickle('testpickle.p')
10 loops, best of 3: 88 ms per loop

So it seems that the built-in is only narrowly better (to me, this is useful because it means it's probably not worth refactoring code to use the built-in) - hope this helps someone!

like image 168
tegan Avatar answered Oct 10 '22 12:10

tegan