Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas retaining index column when using usecols

Tags:

python

pandas

This is a re-worded version of my question which hopefully makes more sense:

When using read_csv with an implicit index (i.e. the first column in the file does not have a header), everything works and I get a dataframe whose index is the first column in the file - the implicit index column.

However, if I specify usecols as an argument to read_csv, the implicit index column is ignored and the returned dataframe has a standard index created by pandas (0, 1, 2, 3 etc).

I cannot explicitly pass the index column in the list for usecols and then specify the index_col argument because the implicit index column has no header (this is how pandas knows it is an implicit index)!

Is there any way around this?

Here is the original question:

I am trying to read a csv file which has a column of row indexes which is not named, the rest of the columns are named:

       |head1|head2|
index1 | data1 | data2 |

When I read in a certain number of columns with usecols, I also want to include the row indexes. However, as these are not named, I can't include the string in my list for usecols.

I've tried doing a combination of an integer index and strings (e.g. usecols = [0, 'header1', 'header2'] but this does not seem to work.
If I simply specify ind_col as 0, it will use the first column in my selection as the index column.

So, how can I read in a name column selection (via usecols) whilst retaining the first, nameless, column in the file as my row index?

like image 392
jramm Avatar asked Sep 11 '13 11:09

jramm


3 Answers

I recently had this same issue and was able to solve it using pandas default unnamed method.

data = pd.read_csv('advertising.csv', header=0, index_col=[0] , usecols=['Unnamed: 0', 'radio','sales'])
like image 157
Jeff W Avatar answered Oct 23 '22 13:10

Jeff W


Try without using usecols, there is a known bug which means this won't work with a separator other than ,.

You can read these directly:

In [11]: pd.read_csv('foo.csv', sep='\s*\|\s*', index_col=[0])
Out[11]: 
        head1  head2  Unnamed: 3
index1  data1  data2         NaN

In [12]: pd.read_csv('foo.csv', sep='\s*\|\s*', index_col=[0]).dropna(axis=1)
Out[12]: 
        head1  head2
index1  data1  data2

Note: I've had to use \s*|\s* as the sep rather than just | so as not to include spaces.

like image 31
Andy Hayden Avatar answered Oct 23 '22 12:10

Andy Hayden


If I understand this question correctly, I think you may have to read in the entire csv file as a dataframe and then select the columns that you want.... Something like this:

import pandas as pd
df = pd.read_csv(yourdata, index_col=0).loc[:,'header1']
like image 44
John Avatar answered Oct 23 '22 11:10

John