Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Completely remove one index label from a multiindex, in a dataframe

Given I have this multiindexed dataframe:

>>> import pandas as p 
>>> import numpy as np
... 
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
...          np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
... 
>>> s = p.Series(np.random.randn(6), index=arrays)
>>> s
bar  one   -1.046752
     two    2.035839
baz  one    1.192775
     two    1.774266
foo  one   -1.716643
     two    1.158605
dtype: float64

How I should do to eliminate index bar?
I tried with drop

>>> s1 = s.drop('bar')
>>> s1
baz  one    1.192775
     two    1.774266
foo  one   -1.716643
     two    1.158605
dtype: float64

Seems OK but bar is still there in some bizarre way:

>>> s1.index
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
           labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1['bar']
Series([], dtype: float64)
>>> 

How could I get ride of any residue from this index label ?

like image 960
joaquin Avatar asked Jun 17 '15 17:06

joaquin


1 Answers

Definitely looks like a bug.

s1.index.tolist() returns to the expected value without "bar".

>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]

s1["bar"] returns a null Series.

>>> s1["bar"]
Series([], dtype: float64)

The standard methods to override this don't seem to work either:

>>> del s1["bar"] 
>>> s1["bar"]
Series([], dtype: float64)
>>> s1.__delitem__("bar")
>>> s1["bar"]
Series([], dtype: float64)

However, as expected, trying grab a new key invokes a KeyError:

>>> s1["booz"]
... KeyError: 'booz'

The main difference is when you actually look at the source code between the two in pandas.core.index.py

class MultiIndex(Index):
    ...

    def _get_levels(self):
        return self._levels

    ...

    def _get_labels(self):
        return self._labels

    # ops compat
    def tolist(self):
        """
        return a list of the Index values
        """
        return list(self.values)

So, the index.tolist() and the _labels aren't accessing the same piece of shared information, in fact, they aren't even close to.

So, we can use this to manually update the resulting indexer.

>>> s1.index.labels
FrozenList([[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
>>> s1.index.values
array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)

If we compare this to the initial multindexed index, we get

>>> s.index.labels
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
>>> s.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])

So the _levels attributes aren't updated, while the values is.

EDIT: Overriding it wasn't as easy as I thought.

EDIT: Wrote a custom function to fix this behavior

from pandas.core.base import FrozenList, FrozenNDArray

def drop(series, level, index_name):
    # make new tmp series
    new_series = series.drop(index_name)
    # grab all indexing labels, levels, attributes
    levels = new_series.index.levels
    labels = new_series.index.labels
    index_pos = levels[level].tolist().index(index_name)
    # now need to reset the actual levels
    level_names = levels[level]
    # has no __delitem__, so... need to remake
    tmp_names = FrozenList([i for i in level_names if i != index_name])
    levels = FrozenList([j if i != level else tmp_names
                         for i, j in enumerate(levels)])
    # need to turn off validation
    new_series.index.set_levels(levels, verify_integrity=False, inplace=True)
    # reset the labels
    level_labels = labels[level].tolist()
    tmp_labels = FrozenNDArray([i-1 if i > index_pos else i
                                for i in level_labels])
    labels = FrozenList([j if i != level else tmp_labels
                         for i, j in enumerate(labels)])
    new_series.index.set_labels(labels, verify_integrity=False, inplace=True)
    return new_series

Example user:

>>> s1 = drop(s, 0, "bar")
>>> s1.index
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
>>> s1["bar"]
...
KeyError: 'bar'

EDIT: This seems to be specific to dataframes/series with multiindexing, as the standard pandas.core.index.Index class does not have the same limitations. I would recommend filing a bug report.

Consider the same series with a standard index:

>>> s = p.Series(np.random.randn(6))
>>> s.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> s.drop(0, inplace=True)
>>> s.index
Int64Index([1, 2, 3, 4, 5], dtype='int64')

The same is true for a dataframe

>>> df = p.DataFrame([np.random.randn(6), np.random.randn(6)])
>>> df.index
Int64Index([0, 1], dtype='int64')
>>> df.drop(0, inplace=True)
>>> df.index
Int64Index([1], dtype='int64')
like image 185
Alexander Huszagh Avatar answered Sep 21 '22 01:09

Alexander Huszagh