Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Old pre-0.17 pandas.read_csv behavior of `header=True` for inferring header row?

How did old pre-0.17 versions of pandas read_csv() interpret passing a boolean header=True/False for inferring the header row?

I have CSV data with header:

col1;col2;col3
1.0;10.0;100.0
2.0;20.0;200.0
3.0;30.0;300.0

If read with header=True

i.e. df = pandas.read_csv('test.csv', sep=';', header=True),

that gives the following data-frame:

   1.0  10.0  100.0
0    2    20    200
1    3    30    300

It means that pandas used the second row ("row 1") for column names (the names inferred are '1.0', '10.0' and '100.0').

whereas if read with header=False

df = pandas.read_csv('test.csv', sep=';', header=False)

gives the following:

   col1  col2  col3
0     1    10   100
1     2    20   200
2     3    30   300

Which means that pandas used the first row ("row 0") as header in spite on the fact that I wrote explicitly that there is no header.

This behaviour is not intuitive to me. Can somebody explain what is happening?

like image 696
Roman Avatar asked Sep 23 '15 10:09

Roman


People also ask

What is header true in pandas?

If read with header=True It means that pandas used the second row ("row 1") for column names (the names inferred are '1.0', '10.0' and '100.0').

What is header in read_csv?

header: this allows you to specify which row will be used as column names for your dataframe. Expected an int value or a list of int values. Default value is header=0 , which means the first row of the CSV file will be treated as column names. If your file doesn't have a header, simply set header=None .

How do I set a header row in pandas?

We can create a data frame of specific number of rows and columns by first creating a multi -dimensional array and then converting it into a data frame by the pandas. DataFrame() method. The columns argument is used to specify the row header or the column names.

What does Error_bad_lines false do?

If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).


1 Answers

You are telling pandas what line is your header line, by passing False this evaluates to 0 which is why it reads in the first line as the header as expected, when you pass True it evaluates to 1 so it reads the second line, if you passed None then it thinks there is no header row and will auto generated ordinal values.

In [17]:    
import io
import pandas as pd
t="""col1;col2;col3
1.0;10.0;100.0
2.0;20.0;200.0
3.0;30.0;300.0"""
print('False:\n', pd.read_csv(io.StringIO(t), sep=';', header=False))
print('\nTrue:\n', pd.read_csv(io.StringIO(t), sep=';', header=True))
print('\nNone:\n', pd.read_csv(io.StringIO(t), sep=';', header=None))

False:
    col1  col2  col3
0     1    10   100
1     2    20   200
2     3    30   300

True:
    1.0  10.0  100.0
0    2    20    200
1    3    30    300

None:
       0     1      2
0  col1  col2   col3
1   1.0  10.0  100.0
2   2.0  20.0  200.0
3   3.0  30.0  300.0

UPDATE

Since version 0.17.0 this will now raise a TypeError

like image 118
EdChum Avatar answered Oct 06 '22 01:10

EdChum