Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas read_csv and filter columns with usecols

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd csv = r"""dummy,date,loc,x    bar,20090101,a,1    bar,20090102,a,3    bar,20090103,a,5    bar,20090101,b,1    bar,20090102,b,3    bar,20090103,b,5"""  f = open('foo.csv', 'w') f.write(csv) f.close()  df1 = pd.read_csv('foo.csv',         header=0,         names=["dummy", "date", "loc", "x"],          index_col=["date", "loc"],          usecols=["dummy", "date", "loc", "x"],         parse_dates=["date"]) print df1  # Ignore the dummy columns df2 = pd.read_csv('foo.csv',          index_col=["date", "loc"],          usecols=["date", "loc", "x"], # <----------- Changed         parse_dates=["date"],         header=0,         names=["dummy", "date", "loc", "x"]) print df2 

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py                dummy  x date       loc 2009-01-01 a     bar  1 2009-01-02 a     bar  3 2009-01-03 a     bar  5 2009-01-01 b     bar  1 2009-01-02 b     bar  3 2009-01-03 b     bar  5               date date loc a    1    20090101      3    20090102      5    20090103 b    1    20090101      3    20090102      5    20090103 

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

edit: fixed bad header usage.

like image 687
chip Avatar asked Feb 22 '13 04:02

chip


People also ask

What does Usecols mean in Python?

usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

How do I read certain columns in pandas?

This can be done with the help of the pandas. read_csv() method. We will pass the first parameter as the CSV file and the second parameter the list of specific columns in the keyword usecols. It will return the data of the CSV file of specific columns.

Is read_csv faster than Read_excel?

Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.


1 Answers

The solution lies in understanding these two keyword arguments:

  • names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
  • usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pd from StringIO import StringIO  csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5"""  df = pd.read_csv(StringIO(csv),         header=0,         index_col=["date", "loc"],          usecols=["date", "loc", "x"],         parse_dates=["date"]) 

Which gives us:

                x date       loc 2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5 
like image 140
Mack Avatar answered Oct 14 '22 21:10

Mack