Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?

Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.

like image 874
MRocklin Avatar asked Dec 24 '16 16:12

MRocklin


People also ask

Is Dask faster than pandas?

The original pandas query took 182 seconds and the optimized Dask query took 19 seconds, which is about 10 times faster. Dask can provide performance boosts over pandas because it can execute common operations in parallel, where pandas is limited to a single core.

How do I read a large CSV file in Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.


1 Answers

Simple Solution

If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:

import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')

Keyword arguments

The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.

df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])

Set the index

Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.

The naive way to do this is to use the set_index method

df2 = df.set_index('timestamp')

However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument

df2 = df.set_index('timestamp', sorted=True)

Divisions

In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.

import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)

This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).

Convert to another format

CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

like image 145
MRocklin Avatar answered Sep 24 '22 06:09

MRocklin