Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest file format for read/write operations with Pandas and/or Numpy [closed]

I've been working for a while with very large DataFrames and I've been using the csv format to store input data and results. I've noticed that a lot of time goes into reading and writing these files which, for example, dramatically slows down batch processing of data. I was wondering if the file format itself is of relevance. Is there a preferred file format for faster reading/writing Pandas DataFrames and/or Numpy arrays?

like image 299
c_david Avatar asked Apr 08 '14 15:04

c_david


People also ask

Which of the following files format is fast in read and write operation in Python?

Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won't loose data type information when writing and reading from disk.

Is Pyarrow faster than pandas?

A Surprising Performance Experiment The pyarrow library is able to construct a pandas. DataFrame faster than using pandas.


3 Answers

Use HDF5. Beats writing flat files hands down. And you can query. Docs are here

Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.

Docs now include a performance section:

See here

like image 182
Jeff Avatar answered Oct 16 '22 20:10

Jeff


Recently pandas added support for the parquet format using as backend the library pyarrow (written by Wes Mckinney himself, with his usual obsession for performance).

You only need to install the pyarrow library and use the methods read_parquet and to_parquet. Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won't loose data type information when writing and reading from disk. It can actually store more efficiently some datatypes that HDF5 are not very performant with (like strings and timestamps: HDF5 doesn't have a native data type for those, so it uses pickle to serialize them, which makes slow for big datasets).

Parquet is also a columnar format, which makes it very easy to do two things:

  • Fastly filter out columns that you're not interested in. With CSV you have to actually read the whole file and only after that you can throw away columns you don't want. With parquet you can actualy read only the columns you're interested.

  • Make queries filtering out rows and reading only what you care.

Another interesting recent development is the Feather file format, which is also developed by Wes Mckinney. It's essentially just an uncompressed arrow format written directly to disk, so it is potentially faster to write than the Parquet format. The disadvantage will be files that are 2-3x larger.

like image 16
Rafael S. Calsaverini Avatar answered Oct 16 '22 20:10

Rafael S. Calsaverini


It's always a good idea to run some benchmarks for your use case. I've had good results storing raw structs via numpy:

df.to_records().astype(mytype).tofile('mydata')
df = pd.DataFrame.from_records(np.fromfile('mydata', dtype=mytype))

It is quite fast and takes up less space on the disk. But: you'll need to keep track of the dtype to reload the data, it's not portable between architectures, and it doesn't support the advanced features of HDF5. (numpy has a more advanced binary format which is designed to overcome the first two limitations, but I haven't had much success getting it to work.)

Update: Thanks for pressing me for numbers. My benchmark indicates that indeed HDF5 wins, at least in my case. It's both faster and smaller on disk! Here's what I see with dataframe of about 280k rows, 7 float columns, and a string index:

In [15]: %timeit df.to_hdf('test_fixed.hdf', 'test', mode='w')
10 loops, best of 3: 172 ms per loop
In [17]: %timeit df.to_records().astype(mytype).tofile('raw_data')
1 loops, best of 3: 283 ms per loop
In [20]: %timeit pd.read_hdf('test_fixed.hdf', 'test')
10 loops, best of 3: 36.9 ms per loop
In [22]: %timeit pd.DataFrame.from_records(np.fromfile('raw_data', dtype=mytype))
10 loops, best of 3: 40.7 ms per loop
In [23]: ls -l raw_data test_fixed.hdf
-rw-r----- 1 altaurog altaurog 18167232 Apr  8 12:42 raw_data
-rw-r----- 1 altaurog altaurog 15537704 Apr  8 12:41 test_fixed.hdf
like image 9
Aryeh Leib Taurog Avatar answered Oct 16 '22 20:10

Aryeh Leib Taurog