Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to exclude indexed rows in pandas dataframe

Tags:

python

pandas

I'm relatively new to Python & pandas and am struggling with (hierachical) indexes. I've got the basics covered, but am lost with more advanced slicing and cross-sectioning.

For example, with the following dataframe

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(9).reshape((3, 3)),
    index=pd.Index(['Ohio', 'Colorado', 'New York'], name='state'), columns=pd.Index(['one', 'two', 'three'], name='number'))

I want to select everything except the row with index 'Colorado'. For a small dataset I could do:

data.ix[['Ohio','New York']]

But if the number of unique index values is large, that's impractical. Naively, I would expect a syntax like

data.ix[['state' != 'Colorado']]

However, this only returns the first record 'Ohio' and doesn't return 'New York'. This works, but is cumbersome

filter = list(set(data.index.get_level_values(0).unique()) - set(['Colorado']))
data[filter]

Surely there's a more Pythonic, verbose way of doing this?

like image 385
dkapitan Avatar asked Feb 08 '14 19:02

dkapitan


1 Answers

This is a Python issue, not a pandas one: 'state' != 'Colorado' is True, so what pandas gets is data.ix[[True]].

You could do

>>> data.loc[data.index != "Colorado"]
number    one  two  three
state                    
Ohio        0    1      2
New York    6    7      8

[2 rows x 3 columns]

or use DataFrame.query:

>>> data.query("state != 'New York'")
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

[2 rows x 3 columns]

if you don't like the duplication of data. (Quoting the expression passed to the .query() method is one of the only ways around the fact that otherwise Python would evaluate the comparison before pandas ever saw it.)

like image 108
DSM Avatar answered Oct 02 '22 00:10

DSM