When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards? For example, using pandas: <pre class="prettyprint"><code>df = pandas.read_csv(filename, index_col=0) </code></pre> Ideally using dask could this be: <pre class="prettyprint"><code>df = dask.dataframe.read_csv(filename, index_col=0) </code></pre> I have tried <pre class="prettyprint"><code>df = dask.dataframe.read_csv(filename).set_index(?) </code></pre> but the index column does not have a name (and this seems slow).

Now you can write: <code>df = pandas.read_csv(filename, index_col='column_name')</code> (Where column name is the name of the column you want to set as the index).

Can I set the index column when reading a CSV using Python dask?

Tags:

python

dataframe

csv

dask

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?

For example, using pandas:

df = pandas.read_csv(filename, index_col=0)

Ideally using dask could this be:

df = dask.dataframe.read_csv(filename, index_col=0)

I have tried

df = dask.dataframe.read_csv(filename).set_index(?)

but the index column does not have a name (and this seems slow).

890

asked Sep 12 '17 10:09

Jaydog

3 Answers

No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.

In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead

But this won't be any slower or faster than doing it the other way.

113

answered Oct 12 '22 13:10

MRocklin

I know I'm a bit late, but this is the first result on google so it should get answered.

If you write your dataframe with:

# index = True is default
my_pandas_df.to_csv('path')

#so this is same
my_pandas_df.to_csv('path', index=True)

And import with Dask:

import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')

It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).

How to figure it out:

my_dask_df = dd.read_csv('path')
my_dask_df.columns

which returns

Index(['Unnamed: 0', 'col 0', 'col 1',
       ...
       'col n'],
      dtype='object', length=...)

answered Oct 12 '22 12:10

E. Bassett

Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

answered Oct 12 '22 12:10

Sunil

Related questions
                            
                                Altering different python objects in parallel processes, respectively
                            
                                Filter out non-zero values in a tensor
                            
                                Airflow installation successfull, but unable to run it
                            
                                CTRL-C causes forrtl: error (200) rather than python KeyboardInterrupt exception
                            
                                Position of Seaborn heatmap annotations in cells
                            
                                scikit-learn error: The least populated class in y has only 1 member
                            
                                Writing more than 4 channel images in OpenCV Python
                            
                                Why should I use a classmethod in python? [duplicate]
                            
                                Moving function/method to class
                            
                                contextlib.redirect_stdout in Python2.7
                            
                                How should I handle importing third-party libraries within my setup.py script?
                            
                                How to json.dumps byte object in python3
                            
                                Install library for jupyter notebook
                            
                                How to use lambda layer in keras?
                            
                                Django Projects and git
                            
                                IndexError: boolean index did not match indexed array along dimension 0
                            
                                How to use Keras TensorBoard callback for grid search
                            
                                How to sync Colors across Subplots of different types Seaborne / Matplotlib
                            
                                Keras' `model.fit_generator()` behaves different than `model.fit()`
                            
                                Is it true that "The set of methods, however, is fixed when the class is first defined"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With