Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index levels doubled when using groupby/apply on a multiindexed dataframe

Tags:

python

pandas

I have a problem when using a groupby/apply chain on a MultiIndex DataFrame: The resulting data frame contains the grouped level(s) twice!

Example data frame:

df = pandas.DataFrame(dict(a=[1, 2, 3, 4, 5, 6], b=[6, 7, 8, 9, 0, 0]),
                      index=pandas.MultiIndex.from_product([[1, 2], [3, 4, 5]]))

     a  b
1 3  1  6
  4  2  7
  5  3  8
2 3  4  9
  4  5  0
  5  6  0

In this example, I simply sum over the rows (I actually have another function, but one that also accepts and returns a dataframe):

df.groupby(level=0).apply(lambda x: x.sum(axis=1))

And it gives me the following result:

1  1  3     7
      4     9
      5    11
2  2  3    13
      4     5
      5     6
dtype: int64

So now I have three index levels, with the one grouped by doubled. When I group by both levels, the whole multiindex in doubled:

df.groupby(level=[0,1]).apply(lambda x: x.sum(axis=1))

1  3  1  3     7
   4  1  4     9
   5  1  5    11
2  3  2  3    13
   4  2  4     5
   5  2  5     6
dtype: int64

If I set as_index=False, I still get another index level, containing ascending numbers:

df.groupby(level=[0,1], as_index=False).apply(lambda x: x.sum(axis=1))

0  1  3     7
1  1  4     9
2  1  5    11
3  2  3    13
4  2  4     5
5  2  5     6
dtype: int64

Is this intended behavior? How can I avoid that another index level is created? Do I have to remove it by hand every time I do a groupby/apply operation?

like image 390
proggy Avatar asked Nov 30 '15 17:11

proggy


1 Answers

There's also the group_keys option which leaves only the MultiIndex of the Series returned by the .sum(axis=1) applied to the individual chunks:

df = pandas.DataFrame(dict(a=[1, 2, 3, 4, 5, 6], b=[6, 7, 8, 9, 0, 0]),
                  index=pandas.MultiIndex.from_product([[1, 2], [3, 4, 5]]))

print(df.groupby(level=0, group_keys=False).apply(lambda x: x.sum(axis=1)))

1  3     7
   4     9
   5    11
2  3    13
   4     5
   5     6
dtype: int64
like image 146
Stefan Avatar answered Oct 06 '22 00:10

Stefan