Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating Pandas MultiIndex after indexing the dataframe

Suppose I have the following dataframe:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
       ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s

                     0         1
first second                    
bar   one    -0.012581  1.421286
      two    -0.048482 -0.153656
baz   one    -2.616540 -1.368694
      two    -1.989319  1.627848
foo   one    -0.404563 -1.099314
      two    -2.006166  0.867398
qux   one    -0.843150 -1.045291
      two     2.129620 -2.697217

I know select a sub-dataframe by indexing:

temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp

                     0         1
first second                    
bar   one    -0.012581  1.421286
      two    -0.048482 -0.153656

However, if I look at the index, the values of the original index still appear:

temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
       labels=[[0, 0], [0, 1]],
       names=[u'first', u'second'])

This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing

df.index
df

This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?

To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:

 temp = s.loc[(('bar', 'foo'), slice(None)), :]

but the index still has 'baz' and 'qux' values:

MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
       labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
       names=[u'first', u'second'])

To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:

import pandas as pd
def update_multiindex(df):
    if isinstance(df.columns, pd.MultiIndex):
        new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:,     key].empty}    
        new_df = pd.DataFrame(new_df)
    else:
        new_df = df.copy()
    if isinstance(df.index, pd.MultiIndex):
        new_df = {key: new_df.loc[key, :] for key in new_df.index if not     new_df.loc[key, :].empty}
        new_df = pd.DataFrame(new_df).T
    return new_df

temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
       labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
like image 986
user1350191 Avatar asked Nov 28 '25 11:11

user1350191


1 Answers

Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.

That said, I suspect (but do not know) that MultiIndex used this way is built on top of CategoricalIndex, which has the method remove_unused_levels(). It may be wrapped by MultiIndex, but I cannot tell, because...

Second, MultiIndex is notably missing from the pandas API documentation. I do not use MultiIndex, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex.

like image 199
Andreus Avatar answered Dec 01 '25 01:12

Andreus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!