<p>I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?</p> <p><em>Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.</em></p>

<h3>Simple Solution</h3> <p>If you just want to get something quickly then simple use of <code>dask.dataframe.read_csv</code> using a globstring for the path should suffice:</p> <pre class="prettyprint"><code>import dask.dataframe as dd df = dd.read_csv('2000-*.csv') </code></pre> <h3>Keyword arguments</h3> <p>The <code>dask.dataframe.read_csv</code> function supports most of the <code>pandas.read_csv</code> keyword arguments, so you might want to tweak things a bit.</p> <pre class="prettyprint"><code>df = dd.read_csv('2000-*.csv', parse_dates=['timestamp']) </code></pre> <h3>Set the index</h3> <p>Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.</p> <p>The naive way to do this is to use the <code>set_index</code> method</p> <pre class="prettyprint"><code>df2 = df.set_index('timestamp') </code></pre> <p>However if you know that your new index column is sorted then you can make this <em>much</em> faster by passing the <code>sorted=True</code> keyword argument</p> <pre class="prettyprint"><code>df2 = df.set_index('timestamp', sorted=True) </code></pre> <h3>Divisions</h3> <p>In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.</p> <pre class="prettyprint"><code>import pandas as pd divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D')) df2 = df.set_index('timestamp', sorted=True, divisions=divisions) </code></pre> <p>This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).</p> <h3>Convert to another format</h3> <p>CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.</p>

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

1 Answers

Simple Solution

If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:

import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')

Keyword arguments

The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.

df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])

Set the index

Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.

The naive way to do this is to use the set_index method

df2 = df.set_index('timestamp')

However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument

df2 = df.set_index('timestamp', sorted=True)

Divisions

In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.

import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)

This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).

Convert to another format

CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

145

answered Sep 24 '22 06:09

MRocklin

Related questions
                            
                                Python attributes and descriptors
                            
                                Simulate alt+tab in Python
                            
                                TypeError: object of type 'method' has no len() [closed]
                            
                                How to execute Python Code on Interpreter Startup in Virtualenv?
                            
                                use cntk trained model with python
                            
                                Why are there different Lemmatizers in NLTK library?
                            
                                subprocess.Popen - No such file or directory [duplicate]
                            
                                In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data
                            
                                Google Drive API - ImportError: cannot import name util
                            
                                pandas replace part of a column with another column
                            
                                Why python bulit-in functions such as sum(),max(),min() can be used to calculate the numpy's datatype ndarray?
                            
                                Which is the more efficient way to choose a random pair of objects from a list of lists or tuples?
                            
                                Cannot catch ConnectionError with requests
                            
                                Check if mail is read, gmail api
                            
                                Flask RestPlus inherit model doesn't work as expected
                            
                                How to compare tensor inside tensorflow?
                            
                                single-step simulation in tensorflow RNN
                            
                                Debug python application running in Docker
                            
                                Unable to import opencv in Jupyter notebook but able to import in command line on Anaconda
                            
                                Slice MultiIndex pandas DataFrame by position

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

Tags:

python

pandas

dask

MRocklin

People also ask

1 Answers

Simple Solution

Keyword arguments

Set the index

Divisions

Convert to another format

MRocklin

Recent Activity

Donate For Us