Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using boolean indexing for row and column MultiIndex in Pandas

Questions are at the end, in bold. But first, let's set up some data:

import numpy as np
import pandas as pd
from itertools import product

np.random.seed(1)

team_names = ['Yankees', 'Mets', 'Dodgers']
jersey_numbers = [35, 71, 84]
game_numbers = [1, 2]
observer_names = ['Bill', 'John', 'Ralph']
observation_types = ['Speed', 'Strength']

row_indices = list(product(team_names, jersey_numbers, game_numbers, observer_names, observation_types))
observation_values = np.random.randn(len(row_indices))

tns, jns, gns, ons, ots = zip(*row_indices)

data = pd.DataFrame({'team': tns, 'jersey': jns, 'game': gns, 'observer': ons, 'obstype': ots, 'value': observation_values})

data = data.set_index(['team', 'jersey', 'game', 'observer', 'obstype'])
data = data.unstack(['observer', 'obstype'])
data.columns = data.columns.droplevel(0)

this gives: data

I want to pluck out a subset of this DataFrame for subsequent analysis. Say I wanted to slice out the rows where the jersey number is 71. I don't really like the idea of using xs to do this. When you do a cross section via xs you lose the column you selected on. If I run:

data.xs(71, axis=0, level='jersey')

then I get back the right rows, but I lose the jersey column.

xs_slice

Also, xs doesn't seem like a great solution for the case where I want a few different values from the jersey column. I think a much nicer solution is the one found here:

data[[j in [71, 84] for t, j, g in data.index]]

boolean_slice_1

You could even filter on a combination of jerseys and teams:

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]

boolean_slice_2

Nice!

So the question: how can I do something similar for selecting a subset of columns. For example, say I want only the columns representing data from Ralph. How can I do that without using xs? Or what if I wanted only the columns with observer in ['John', 'Ralph']? Again, I'd really prefer a solution that keeps all the levels of the row and column indices in the result...just like the boolean indexing examples above.

I can do what I want, and even combine selections from both the row and column indices. But the only solution I've found involves some real gymnastics:

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

double_boolean_slice

And thus the second question: is there a more compact way to do what I just did above?

like image 749
8one6 Avatar asked Dec 24 '13 04:12

8one6


1 Answers

As of Pandas 0.18 (possibly earlier) you can easily slice multi-indexed DataFrames using pd.IndexSlice.

For your specific question, you can use the following to select by team, jersey, and game:

data.loc[pd.IndexSlice[:,[71, 84],:],:] #IndexSlice on the rows

IndexSlice needs just enough level information to be unambiguous so you can drop the trailing colon:

data.loc[pd.IndexSlice[:,[71, 84]],:]

Likewise, you can IndexSlice on columns:

data.loc[pd.IndexSlice[:,[71, 84]],pd.IndexSlice[['John', 'Ralph']]]

Which gives you the final DataFrame in your question.

like image 52
Cory Jog Avatar answered Sep 19 '22 03:09

Cory Jog