Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to control index returned by pandas groupby with rolling summary function

Tags:

I have data with a MultiIndex, like this:

import itertools
idx1 = list('XYZ')
idx2 = range(3)
idx = pd.MultiIndex.from_tuples(list(itertools.product(idx1,idx2)))
df = pd.DataFrame(np.random.rand(9,4), columns=list('ABCD'), index=idx)

                     A         B         C         D
first second                                        
X     0       0.808432  0.708881  0.411515  0.704168
      1       0.322688  0.093869  0.651238  0.146480
      2       0.800746  0.156890  0.131700  0.220423
Y     0       0.102290  0.129895  0.939147  0.510555
      1       0.462014  0.749873  0.585867  0.357788
      2       0.794327  0.141203  0.414841  0.923480
Z     0       0.557513  0.768428  0.487475  0.824503
      1       0.258303  0.115791  0.102588  0.062753
      2       0.934960  0.700371  0.319663  0.642070

Here is the result for summing by group over the first index level:

In[]: df.groupby(level=0).sum()
Out[]: 
              A         B         C         D
first                                        
X      1.931866  0.959640  1.194453  1.071071
Y      1.358631  1.020971  1.939855  1.791824
Z      1.750776  1.584590  0.909725  1.529326

Seems reasonable -- I summed over the first level of the index, so the 2nd level is gone. But if instead I use the rolling method:

df.groupby(level=0).rolling(2).sum()

I get

                           A         B         C         D
first first second                                        
X     X     0            NaN       NaN       NaN       NaN
            1       1.131120  0.802750  1.062753  0.850648
            2       1.123434  0.250759  0.782938  0.366903
Y     Y     0            NaN       NaN       NaN       NaN
            1       0.564303  0.879768  1.525014  0.868343
            2       1.256341  0.891075  1.000708  1.281269
Z     Z     0            NaN       NaN       NaN       NaN
            1       0.815816  0.884219  0.590062  0.887256
            2       1.193263  0.816162  0.422251  0.704823

where for some reason pandas has decided to return a 3-level index, repeating the first level. Why is this happening? Is there a better way to write my code so it doesn't do this?

Also, since the first label is repeated, calling reset_index() on the result gives ValueError: cannot insert first, already exists so I can't see how to drop the repeated index. Any tips?

like image 959
itzy Avatar asked Jun 15 '16 19:06

itzy


1 Answers

Use group_keys=False:

In [43]: df.groupby(level=0, group_keys=False).rolling(2).sum()
Out[43]: 
            A         B         C         D
X 0       NaN       NaN       NaN       NaN
  1  1.244257  1.430957  0.798310  0.779261
  2  0.632238  1.512251  1.473498  0.395945
Y 0       NaN       NaN       NaN       NaN
  1  1.241747  0.865178  0.550665  1.070216
  2  1.629892  1.328947  1.046749  1.167371
Z 0       NaN       NaN       NaN       NaN
  1  0.406606  0.945525  0.936090  1.301093
  2  0.701282  0.975851  0.586523  0.698980

In contrast to:

In [44]: df.groupby(level=0, group_keys=True).rolling(2).sum()
Out[44]: 
              A         B         C         D
X X 0       NaN       NaN       NaN       NaN
    1  1.244257  1.430957  0.798310  0.779261
    2  0.632238  1.512251  1.473498  0.395945
Y Y 0       NaN       NaN       NaN       NaN
    1  1.241747  0.865178  0.550665  1.070216
    2  1.629892  1.328947  1.046749  1.167371
Z Z 0       NaN       NaN       NaN       NaN
    1  0.406606  0.945525  0.936090  1.301093
    2  0.701282  0.975851  0.586523  0.698980

By the way, if you do find yourself stuck with an MultiIndex level that you wish to drop, you can use the MultiIndex.droplevel method:

result = df.groupby(level=0, group_keys=True).rolling(2).sum()
result.index = result.index.droplevel(level=0)
like image 169
unutbu Avatar answered Sep 28 '22 02:09

unutbu