I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None. Here is the link to the data : http://www.fec.gov/disclosurep/PDownload.do. Here is the loading code (to save time in the checking, I set the nrows=10): <pre class="prettyprint"><code>import pandas as pd fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None) </code></pre> To keep it short I am excluding the data column outputs, but here is my output (please not the Index values): <pre class="prettyprint"><code>In [20]: fec Out[20]: <class 'pandas.core.frame.DataFrame'> Index: 10 entries, C00410118 to C00410118 Data columns: ... dtypes: float64(4), int64(3), object(11) </code></pre> And here is the book's output (again with data columns excluded): <pre class="prettyprint"><code>In [13]: fec = read_csv('P00000001-ALL.csv') In [14]: fec Out[14]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1001731 entries, 0 to 1001730 ... dtypes: float64(1), int64(1), object(14) </code></pre> The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers. I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.

<h3>Quick Answer</h3> Use <code>index_col=False</code> instead of <code>index_col=None</code> when you have delimiters at the end of each line to turn off index column inference and discard the last column. <h3>More Detail</h3> After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created): <blockquote> index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index. </blockquote> from the documentation shows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index. <hr> EDIT 10/20/2014 - More information I found another valuable entry that is specifically about trailing limiters and how to simply ignore them: <blockquote> If a file has one more column of data than the number of column names, the first column will be used as the DataFrame’s row names: ... Ordinarily, you can achieve this behavior using the index_col option. There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ... </blockquote>

If pandas is treating your first row as a header, you can use header = none as such: <pre class="prettyprint"><code>df = pd.read_csv ("csv-file.csv", header=None) </code></pre> this way pandas will treat your first row as like any row.

pandas read_csv index_col=None not working with delimiters at the end of each line

Tags:

python

pandas

I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None.

Here is the link to the data : http://www.fec.gov/disclosurep/PDownload.do.

Here is the loading code (to save time in the checking, I set the nrows=10):

import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None)

To keep it short I am excluding the data column outputs, but here is my output (please not the Index values):

In [20]: fec

Out[20]:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, C00410118 to C00410118
Data columns:
...
dtypes: float64(4), int64(3), object(11)

And here is the book's output (again with data columns excluded):

In [13]: fec = read_csv('P00000001-ALL.csv')
In [14]: fec
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
...
dtypes: float64(1), int64(1), object(14)

The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers.

I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.

465

asked Oct 18 '12 17:10

Rich

3 Answers

Quick Answer

Use index_col=False instead of index_col=None when you have delimiters at the end of each line to turn off index column inference and discard the last column.

More Detail

After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created):

index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.

from the documentation shows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index.

EDIT 10/20/2014 - More information

I found another valuable entry that is specifically about trailing limiters and how to simply ignore them:

If a file has one more column of data than the number of column names, the first column will be used as the DataFrame’s row names: ...

Ordinarily, you can achieve this behavior using the index_col option.

There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ...

answered Oct 21 '22 23:10

craigts

Re: craigts's response, for anyone having trouble with using either False or None parameters for index_col, such as in cases where you're trying to get rid of a range index, you can instead use an integer to specify the column you want to use as the index. For example:

df = pd.read_csv('file.csv', index_col=0)

The above will set the first column as the index (and not add a range index in my "common case").

Update

Given the popularity of this answer, I thought i'd add some context/ a demo:

# Setting up the dummy data
In [1]: df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})

In [2]: df
Out[2]:
   A  B
0  1  4
1  2  5
2  3  6

In [3]: df.to_csv('file.csv', index=None)
File[3]:
A  B
1  4
2  5
3  6

Reading without index_col or with None/False will all result in a range index:

In [4]: pd.read_csv('file.csv')
Out[4]:
   A  B
0  1  4
1  2  5
2  3  6

# Note that this is the default behavior, so the same as In [4]
In [5]: pd.read_csv('file.csv', index_col=None)
Out[5]:
   A  B
0  1  4
1  2  5
2  3  6

In [6]: pd.read_csv('file.csv', index_col=False)
Out[6]:
   A  B
0  1  4
1  2  5
2  3  6

However, if we specify that "A" (the 0th column) is actually the index, we can avoid the range index:

In [7]: pd.read_csv('file.csv', index_col=0)
Out[7]:
   B
A
1  4
2  5
3  6

answered Oct 22 '22 00:10

ZaxR

If pandas is treating your first row as a header, you can use header = none as such:

df = pd.read_csv ("csv-file.csv", header=None)

this way pandas will treat your first row as like any row.

answered Oct 21 '22 22:10

Nadeem Zeaiter

Related questions
                            
                                Using python Requests with javascript pages
                            
                                Matplotlib xticks not lining up with histogram
                            
                                Find the number of occurrences of a subsequence in a string
                            
                                "synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'." problem in TensorFlow
                            
                                Good or bad practice in Python: import in the middle of a file [duplicate]
                            
                                Atomic increment of a counter in django
                            
                                Modifying list while iterating [duplicate]
                            
                                Check if a string in a Pandas DataFrame column is in a list of strings
                            
                                Python how to exit main function [duplicate]
                            
                                gradient descent using python and numpy
                            
                                How to use openCV's connected components with stats in python?
                            
                                how to get request object in django unit testing?
                            
                                Drop all data in a pandas dataframe
                            
                                catching SQLAlchemy exceptions
                            
                                How to launch python Idle from a virtual environment (virtualenv)
                            
                                How do I properly set the Datetimeindex for a Pandas datetime object in a dataframe?
                            
                                How do I check if keras is using gpu version of tensorflow?
                            
                                Getting the date of 7 days ago from current date in python [closed]
                            
                                AttributeError: Module Pip has no attribute 'main'
                            
                                How can I start ipython running a script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With