Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing Pandas data frames: integer rows, named columns

Say df is a pandas dataframe.

  • df.loc[] only accepts names
  • df.iloc[] only accepts integers (actual placements)
  • df.ix[] accepts both names and integers:

When referencing rows, df.ix[row_idx, ] only wants to be given names. e.g.

df = pd.DataFrame({'a' : ['one', 'two', 'three','four', 'five', 'six'],                    '1' : np.arange(6)}) df = df.ix[2:6] print(df)     1      a 2  2  three 3  3   four 4  4   five 5  5    six  df.ix[0, 'a'] 

throws an error, it doesn't give return 'two'.

When referencing columns, iloc is prefers integers, not names. e.g.

df.ix[2, 1] 

returns 'three', not 2. (Although df.idx[2, '1'] does return 2).

Oddly, I'd like the exact opposite functionality. Usually my column names are very meaningful, so in my code I reference them directly. But due to a lot of observation cleaning, the row names in my pandas data frames don't usually correspond to range(len(df)).

I realize I can use:

df.iloc[0].loc['a'] # returns three 

But it seems ugly! Does anyone know of a better way to do this, so that the code would look like this?

df.foo[0, 'a'] # returns three 

In fact, is it possible to add on my own new method to pandas.core.frame.DataFrames, so e.g. df.idx(rows, cols) is in fact df.iloc[rows].loc[cols]?

like image 587
Hillary Sanders Avatar asked Feb 26 '15 23:02

Hillary Sanders


People also ask

How do I index rows and columns in pandas?

Select Rows & Columns by Name or Index in Pandas DataFrame using [ ], loc & iloc. Indexing in Pandas means selecting rows and columns of data from a Dataframe.

How are pandas datasets indexed?

Indexing in Pandas : Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Can we pass integer in data frame in pandas?

Construction. pandas can represent integer data with possibly missing values using arrays.


2 Answers

It's a late answer, but @unutbu's comment is still valid and a great solution to this problem.

To index a DataFrame with integer rows and named columns (labeled columns):

df.loc[df.index[#], 'NAME'] where # is a valid integer index and NAME is the name of the column.

like image 128
brunston Avatar answered Sep 19 '22 13:09

brunston


The existing answers seem short-sighted to me.

Problematic Solutions

  1. df.loc[df.index[0], 'a']
    The strategy here is to get the row label of the 0th row and then use .loc as normal. I see two issues.

    1. If df has repeated row labels, df.loc[df.index[0], 'a'] could return multiple rows.
    2. .loc is slower than .iloc so you're sacrificing speed here.
  2. df.reset_index(drop=True).loc[0, 'a']
    The strategy here is to reset the index so the row labels become 0, 1, 2, ... thus .loc[0] gives the same result as .iloc[0]. Still, the problem here is runtime, as .loc is slower than .iloc and you'll incur a cost for resetting the index.

Better Solution

I suggest following @Landmaster's comment:

df.iloc[0, df.columns.get_loc("a")] 

Essentially, this is the same as df.iloc[0, 0] except we get the column index dynamically using df.columns.get_loc("a").

To index multiple columns such as ['a', 'b', 'c'], use:

df.iloc[0, [df.columns.get_loc(c) for c in ['a', 'b', 'c']]] 

Update

This is discussed here as part of my course on Pandas.

like image 34
Ben Avatar answered Sep 21 '22 13:09

Ben