Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set the headers using pandas.read_csv

Tags:

python

pandas

I have a csv file that I read into a dataframe using the pandas API. I intend to set my own header instead of the default first row. (I also get rid of some of the rows.) How do I best achieve this?

I tried the following but this didn't work as expected:

header_row=['col1','col2','col3','col4', 'col1', 'col2'] # note the header has duplicate column values
df = pandas.read_csv(csv_file, skiprows=[0,1,2,3,4,5], names=header_row)

This gives following error -

File "third_party/py/pandas/io/parsers.py", line 187, in read_csv
File "third_party/py/pandas/io/parsers.py", line 160, in _read
File "third_party/py/pandas/io/parsers.py", line 628, in get_chunk
File "third_party/py/pandas/core/frame.py", line 302, in __init__
File "third_party/py/pandas/core/frame.py", line 388, in _init_dict
File "third_party/py/pandas/core/internals.py", line 1008, in form_blocks
File "third_party/py/pandas/core/internals.py", line 1036, in _simple_blockify
File "third_party/py/pandas/core/internals.py", line 1068, in _stack_dict
IndexError: index out of bounds

I then tried settings the columns via

df.columns = header_row

But this error-ed out probably because of duplicate column values.

File "engines.pyx", line 101, in pandas._engines.DictIndexEngine.get_loc    
(third_party/py/pandas/src/engines.c:2498)
File "engines.pyx", line 107, in pandas._engines.DictIndexEngine.get_loc 
(third_party/py/pandas/src/engines.c:2447)
Exception: ('Index values are not unique', 'occurred at index entity')

I am using pandas 0.7.3 version. From the documentation -

names : array-like List of column names

I am sure I am missing something simple here. Thanks for any help here.

like image 275
Manju Avatar asked Aug 22 '12 05:08

Manju


1 Answers

Pandas 0.7.3 does not support index duplicates. You need at least 0.8.0, between 0.8.0 and 0.8.1 several issues with duplicates in the index are fixed, so 0.8.1 (=most recent stable release) might be best. However even 0.8.1 is not an answer to your problem, because this version has an issue with duplicate column names (you can not display a dataframe with duplicate column names).

like image 163
Wouter Overmeire Avatar answered Sep 28 '22 02:09

Wouter Overmeire