Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiindex pandas groupby + aggregate, keep full index

Tags:

python

pandas

I have a two-level hierarchically-indexed sequence of integers.

 >> s
 id1    id2    
 1      a     100
        b      10
        c       9 
 2      a    2000
 3      a       5
        b      10
        c      15
        d      20
 ...

I want to group by id1, and select the maximum value, but have the full index in the result. I have tried the following:

 >> s.groupby(level=0).aggregate(np.max)
 id1              
 1    100 
 2   2000
 3     20

But result is indexed by id1 only. I want my output to look like this:

 id1    id2    
 1      a     100
 2      a    2000
 3      d      20

A related, but more complicated, question was asked here: Multiindexed Pandas groupby, ignore a level? As it states, the answer is kind of a hack.

Does anyone know a better solution? If not, what about the special case where every value of id2 is unique?

like image 721
hanna Avatar asked Feb 25 '15 10:02

hanna


1 Answers

One way to select full rows after a groupby is to use groupby/transform to build a boolean mask and then use the mask to select the full rows from s:

In [110]: s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
Out[110]: 
id1  id2
1    a       100
2    a      2000
3    d        20
Name: s, dtype: int64

Another way, which is faster in some cases -- such as when there are a lot of groups -- is to merge the max values m into a DataFrame along with the values in s, and then select rows based on equality between m and s:

def using_merge(s):
    m = s.groupby(level=0).agg(np.max)
    df = s.reset_index(['id2'])
    df['m'] = m
    result = df.loc[df['s']==df['m']]
    del result['m']
    result = result.set_index(['id2'], append=True)
    return result['s']

Here is an example showing using_merge, while more complicated, may be faster than using_transform:

import numpy as np
import pandas as pd
def using_transform(s):
    return s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]

N = 10**5
id1 = np.random.randint(100, size=N)
id2 = np.random.choice(list('abcd'), size=N)
index = pd.MultiIndex.from_arrays([id1, id2])
ss = pd.Series(np.random.randint(100, size=N), index=index)
ss.index.names = ['id1', 'id2']
ss.name = 's'

Timing these two functions using IPython's %timeit function yields:

In [121]: %timeit using_merge(ss)
100 loops, best of 3: 12.8 ms per loop

In [122]: %timeit using_transform(ss)
10 loops, best of 3: 45 ms per loop
like image 197
unutbu Avatar answered Sep 30 '22 16:09

unutbu