Suppose I have the following dataframe:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
baz one -2.616540 -1.368694
two -1.989319 1.627848
foo one -0.404563 -1.099314
two -2.006166 0.867398
qux one -0.843150 -1.045291
two 2.129620 -2.697217
I know select a sub-dataframe by indexing:
temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
However, if I look at the index, the values of the original index still appear:
temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing
df.index
df
This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?
To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:
temp = s.loc[(('bar', 'foo'), slice(None)), :]
but the index still has 'baz' and 'qux' values:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
names=[u'first', u'second'])
To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:
import pandas as pd
def update_multiindex(df):
if isinstance(df.columns, pd.MultiIndex):
new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:, key].empty}
new_df = pd.DataFrame(new_df)
else:
new_df = df.copy()
if isinstance(df.index, pd.MultiIndex):
new_df = {key: new_df.loc[key, :] for key in new_df.index if not new_df.loc[key, :].empty}
new_df = pd.DataFrame(new_df).T
return new_df
temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.
That said, I suspect (but do not know) that MultiIndex used this way is built on top of CategoricalIndex, which has the method remove_unused_levels(). It may be wrapped by MultiIndex, but I cannot tell, because...
Second, MultiIndex is notably missing from the pandas API documentation. I do not use MultiIndex, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With