I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.
The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
If you just want to get something quickly then simple use of dask.dataframe.read_csv
using a globstring for the path should suffice:
import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')
The dask.dataframe.read_csv
function supports most of the pandas.read_csv
keyword arguments, so you might want to tweak things a bit.
df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the set_index
method
df2 = df.set_index('timestamp')
However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True
keyword argument
df2 = df.set_index('timestamp', sorted=True)
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With