I have a csv file which isn't coming in correctly with <code>pandas.read_csv</code> when I filter the columns with <code>usecols</code> and use multiple indexes. <pre class="prettyprint"><code>import pandas as pd csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" f = open('foo.csv', 'w') f.write(csv) f.close() df1 = pd.read_csv('foo.csv', header=0, names=["dummy", "date", "loc", "x"], index_col=["date", "loc"], usecols=["dummy", "date", "loc", "x"], parse_dates=["date"]) print df1 # Ignore the dummy columns df2 = pd.read_csv('foo.csv', index_col=["date", "loc"], usecols=["date", "loc", "x"], # <----------- Changed parse_dates=["date"], header=0, names=["dummy", "date", "loc", "x"]) print df2 </code></pre> I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date. <pre class="prettyprint"><code>In [118]: %run test.py dummy x date loc 2009-01-01 a bar 1 2009-01-02 a bar 3 2009-01-03 a bar 5 2009-01-01 b bar 1 2009-01-02 b bar 3 2009-01-03 b bar 5 date date loc a 1 20090101 3 20090102 5 20090103 b 1 20090101 3 20090102 5 20090103 </code></pre> Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1. edit: fixed bad header usage.

The solution lies in understanding these two keyword arguments: <ul> <li> names is only necessary when there is no header row in your file and you want to specify other arguments (such as <code>usecols</code>) using column names rather than integer indices.</li> <li> usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.</li> </ul> So because you have a header row, passing <code>header=0</code> is sufficient and additionally passing <code>names</code> appears to be confusing <code>pd.read_csv</code>. Removing <code>names</code> from the second call gives the desired output: <pre class="prettyprint"><code>import pandas as pd from StringIO import StringIO csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" df = pd.read_csv(StringIO(csv), header=0, index_col=["date", "loc"], usecols=["date", "loc", "x"], parse_dates=["date"]) </code></pre> Which gives us: <pre class="prettyprint"><code> x date loc 2009-01-01 a 1 2009-01-02 a 3 2009-01-03 a 5 2009-01-01 b 1 2009-01-02 b 3 2009-01-03 b 5 </code></pre>

pandas read_csv and filter columns with usecols

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd csv = r"""dummy,date,loc,x    bar,20090101,a,1    bar,20090102,a,3    bar,20090103,a,5    bar,20090101,b,1    bar,20090102,b,3    bar,20090103,b,5"""  f = open('foo.csv', 'w') f.write(csv) f.close()  df1 = pd.read_csv('foo.csv',         header=0,         names=["dummy", "date", "loc", "x"],          index_col=["date", "loc"],          usecols=["dummy", "date", "loc", "x"],         parse_dates=["date"]) print df1  # Ignore the dummy columns df2 = pd.read_csv('foo.csv',          index_col=["date", "loc"],          usecols=["date", "loc", "x"], # <----------- Changed         parse_dates=["date"],         header=0,         names=["dummy", "date", "loc", "x"]) print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py                dummy  x date       loc 2009-01-01 a     bar  1 2009-01-02 a     bar  3 2009-01-03 a     bar  5 2009-01-01 b     bar  1 2009-01-02 b     bar  3 2009-01-03 b     bar  5               date date loc a    1    20090101      3    20090102      5    20090103 b    1    20090101      3    20090102      5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

edit: fixed bad header usage.

What does Usecols mean in Python?

usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

How do I read certain columns in pandas?

This can be done with the help of the pandas. read_csv() method. We will pass the first parameter as the CSV file and the second parameter the list of specific columns in the keyword usecols. It will return the data of the CSV file of specific columns.

Is read_csv faster than Read_excel?

Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.

The solution lies in understanding these two keyword arguments:

names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pd from StringIO import StringIO  csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5"""  df = pd.read_csv(StringIO(csv),         header=0,         index_col=["date", "loc"],          usecols=["date", "loc", "x"],         parse_dates=["date"])

Which gives us:

                x date       loc 2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5

pandas read_csv and filter columns with usecols

Tags:

python

pandas

csv

csv-import

chip

People also ask

1 Answers

Mack

Recent Activity

Donate For Us

pandas read_csv and filter columns with usecols

Tags:

python

pandas

csv

csv-import

chip

People also ask

1 Answers

Mack

Related questions

Recent Activity

Donate For Us