Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do pandas and dask perform better when importing from CSV compared to HDF5?

Tags:

python

hdf5

dask

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files).

In order to benchmark performance, I did the following:

def dask_read_from_hdf():
    results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
    analyzed_stocks_dd_hdf =  results_dd_hdf.Security.unique()
    hdf.close()

def pandas_read_from_hdf():
    results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
    analyzed_stocks_pd_hdf =  results_pd_hdf.Security.unique()
    hdf.close()

def dask_read_from_csv():
    results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
    analyzed_stocks_dd_csv =  results_dd_csv.Security.unique()

def pandas_read_from_csv():
    results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
    analyzed_stocks_pd_csv =  results_pd_csv.Security.unique()

print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()

My findings are:

dask hdf performance
10 loops, best of 3: 133 ms per loop

pandas hdf performance
1 loop, best of 3: 1.42 s per loop

dask csv performance
1 loop, best of 3: 7.88 ms per loop

pandas csv performance
1 loop, best of 3: 827 ms per loop

When hdf5 storage can be accessed faster than .csv, and when dask creates dataframes faster than pandas, why is dask from hdf5 slower than dask from csv? Am I doing something wrong?

When does it make sense for performance to create dask dataframes from HDF5 storage objects?

like image 279
sudonym Avatar asked Sep 01 '17 05:09

sudonym


People also ask

Is DASK faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.

Is DASK compatible with pandas?

Pandas does most of the things pretty well but screws in quite a few. Dask does not support these two things as well. There are numerous other things for which you'll have to use pandas.

Is pandas good for big data?

Pandas uses in-memory computation which makes it ideal for small to medium sized datasets. However, Pandas ability to process big datasets is limited due to out-of-memory errors.

Are pandas memory efficient?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.


1 Answers

HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.

Performance of string data with HDF5 can be dramatically improved by using a Categorical to store your strings, assuming relatively low cardinality (high number of repeated values)

It's from a little while back, but a good blog post here going through exactly these considerations. http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.

Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21) will have to_parquet and pd.read_parquet functions built in.

https://arrow.apache.org/docs/python/parquet.html

https://fastparquet.readthedocs.io/en/latest/

https://matthewrocklin.com/blog//work/2017/06/28/use-parquet

like image 144
chrisb Avatar answered Nov 15 '22 06:11

chrisb