Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I set the index column when reading a CSV using Python dask?

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?

For example, using pandas:

df = pandas.read_csv(filename, index_col=0)

Ideally using dask could this be:

df = dask.dataframe.read_csv(filename, index_col=0)

I have tried

df = dask.dataframe.read_csv(filename).set_index(?)

but the index column does not have a name (and this seems slow).

like image 890
Jaydog Avatar asked Sep 12 '17 10:09

Jaydog


People also ask

How do you set a column to index a DataFrame in Python?

To create an index, from a column, in Pandas dataframe you use the set_index() method. For example, if you want the column “Year” to be index you type <code>df. set_index(“Year”)</code>. Now, the set_index() method will return the modified dataframe as a result.

What is index in CSV file?

This schema, allows your csv database to accept record additions, deletions and updates. Additions are made at the end of the file. To delete a record, just change the first character of the record with a unique character like 0x0 and of course delete the entry from the index file.


3 Answers

No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.

In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead

But this won't be any slower or faster than doing it the other way.

like image 113
MRocklin Avatar answered Oct 12 '22 13:10

MRocklin


I know I'm a bit late, but this is the first result on google so it should get answered.

If you write your dataframe with:

# index = True is default
my_pandas_df.to_csv('path')

#so this is same
my_pandas_df.to_csv('path', index=True)

And import with Dask:

import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')

It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).

How to figure it out:

my_dask_df = dd.read_csv('path')
my_dask_df.columns

which returns

Index(['Unnamed: 0', 'col 0', 'col 1',
       ...
       'col n'],
      dtype='object', length=...)
like image 1
E. Bassett Avatar answered Oct 12 '22 12:10

E. Bassett


Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

like image 1
Sunil Avatar answered Oct 12 '22 12:10

Sunil