I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files).
In order to benchmark performance, I did the following:
def dask_read_from_hdf():
results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique()
hdf.close()
def pandas_read_from_hdf():
results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_pd_hdf = results_pd_hdf.Security.unique()
hdf.close()
def dask_read_from_csv():
results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_dd_csv = results_dd_csv.Security.unique()
def pandas_read_from_csv():
results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_pd_csv = results_pd_csv.Security.unique()
print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()
My findings are:
dask hdf performance
10 loops, best of 3: 133 ms per loop
pandas hdf performance
1 loop, best of 3: 1.42 s per loop
dask csv performance
1 loop, best of 3: 7.88 ms per loop
pandas csv performance
1 loop, best of 3: 827 ms per loop
When hdf5 storage can be accessed faster than .csv, and when dask creates dataframes faster than pandas, why is dask from hdf5 slower than dask from csv? Am I doing something wrong?
When does it make sense for performance to create dask dataframes from HDF5 storage objects?
The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.
Pandas does most of the things pretty well but screws in quite a few. Dask does not support these two things as well. There are numerous other things for which you'll have to use pandas.
Pandas uses in-memory computation which makes it ideal for small to medium sized datasets. However, Pandas ability to process big datasets is limited due to out-of-memory errors.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.
Performance of string data with HDF5 can be dramatically improved by using a Categorical
to store your strings, assuming relatively low cardinality (high number of repeated values)
It's from a little while back, but a good blog post here going through exactly these considerations. http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.
Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21
) will have to_parquet
and pd.read_parquet
functions built in.
https://arrow.apache.org/docs/python/parquet.html
https://fastparquet.readthedocs.io/en/latest/
https://matthewrocklin.com/blog//work/2017/06/28/use-parquet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With