I have a csv file which isn't coming in correctly with pandas.read_csv
when I filter the columns with usecols
and use multiple indexes.
import pandas as pd csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" f = open('foo.csv', 'w') f.write(csv) f.close() df1 = pd.read_csv('foo.csv', header=0, names=["dummy", "date", "loc", "x"], index_col=["date", "loc"], usecols=["dummy", "date", "loc", "x"], parse_dates=["date"]) print df1 # Ignore the dummy columns df2 = pd.read_csv('foo.csv', index_col=["date", "loc"], usecols=["date", "loc", "x"], # <----------- Changed parse_dates=["date"], header=0, names=["dummy", "date", "loc", "x"]) print df2
I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.
In [118]: %run test.py dummy x date loc 2009-01-01 a bar 1 2009-01-02 a bar 3 2009-01-03 a bar 5 2009-01-01 b bar 1 2009-01-02 b bar 3 2009-01-03 b bar 5 date date loc a 1 20090101 3 20090102 5 20090103 b 1 20090101 3 20090102 5 20090103
Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.
edit: fixed bad header usage.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.
This can be done with the help of the pandas. read_csv() method. We will pass the first parameter as the CSV file and the second parameter the list of specific columns in the keyword usecols. It will return the data of the CSV file of specific columns.
Importing csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster! Python loads CSV files 100 times faster than Excel files.
The solution lies in understanding these two keyword arguments:
usecols
) using column names rather than integer indices.So because you have a header row, passing header=0
is sufficient and additionally passing names
appears to be confusing pd.read_csv
.
Removing names
from the second call gives the desired output:
import pandas as pd from StringIO import StringIO csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" df = pd.read_csv(StringIO(csv), header=0, index_col=["date", "loc"], usecols=["date", "loc", "x"], parse_dates=["date"])
Which gives us:
x date loc 2009-01-01 a 1 2009-01-02 a 3 2009-01-03 a 5 2009-01-01 b 1 2009-01-02 b 3 2009-01-03 b 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With