In Pandas, what is a good way to select sets of arbitrary rows in a multiindex?
df = pd.DataFrame(columns=['A', 'B', 'C'])
df['A'] = ['a', 'a', 'b', 'b']
df['B'] = [1,2,3,4]
df['C'] = [1,2,3,4]
the_indices_we_want = df.ix[[0,3],['A','B']]
df = df.set_index(['A', 'B']) #Create a multiindex
df.ix[the_indices_we_want] #ValueError: Cannot index with multidimensional key
df.ix[[tuple(x) for x in the_indices_we_want.values]]
This last line is an answer, but it feels clunky answer; they can't even be lists, they have to be tuples. It also involves generating a new object to do the indexing with. I'm in a situation where I'm trying to do a lookup on a multiindex dataframe, with indices from another dataframe:
data_we_want = dataframe_with_the_data.ix[dataframe_with_the_indices[['Index1','Index2']]]
Right now it looks like I need to write it like this:
data_we_want = dataframe_with_the_data.ix[[tuple(x) for x in dataframe_with_the_indices[['Index1','Index2']].values]]
That is workable, but if there are many rows (i.e. hundreds of millions of desired indices) then generating this list of tuples becomes quite the burden. Any solutions?
Edit: The solution by @joris works, but not if the indices are all numbers. Example where the indices are all integers:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df['A'] = ['a', 'a', 'b', 'b']
df['B'] = [1,2,3,4]
df['C'] = [1,2,3,4]
the_indices_we_want = df.ix[[0,3],['B','C']]
df = df.set_index(['B', 'C'])
df.ix[pd.Index(the_indices_we_want)] #ValueError: Cannot index with multidimensional key
df.ix[pd.Index(the_indices_we_want.astype('object'))] #Works, though feels clunky.
To drop multiple levels from a multi-level column index, use the columns. droplevel() repeatedly.
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero. Yields below output.
The explanation: Dataframes always have an index, and there is no way of how to remove it, because it is a core part of every dataframe. ( iloc[0:4]['col name'] is a dataframe, too.) You can only hide it in your output.
You indeed cannot index with a DataFrame directly, but if you convert it to an Index object, it does the correct thing (a row in the DataFrame will be regarded as one multi-index entry):
In [43]: pd.Index(the_indices_we_want)
Out[43]: Index([(u'a', 1), (u'b', 4)], dtype='object')
In [44]: df.ix[pd.Index(the_indices_we_want)]
Out[44]:
C
A B
a 1 1
b 4 4
In [45]: df.ix[[tuple(x) for x in the_indices_we_want.values]]
Out[45]:
C
A B
a 1 1
b 4 4
This is a somewhat cleaner. And with some quick tests it seems to be a bit faster (but not much, only 2 times)
In newer versions of pandas you can simply use .iloc for row indexing.
df = pd.DataFrame(columns=['A', 'B', 'C'])
df['A'] = ['a', 'a', 'b', 'b']
df['B'] = [1,2,3,4]
df['C'] = [1,2,3,4]
df.iloc[[0, 3]][['A', 'B']]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With