Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to group by and sum MultiIndex

I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.

x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())

output

11  x    1
    y    3
    x    1
    y    3
12  x    3
    y    5
    x    3
    y    5
dtype: int64
11  x    1
    y    3
    x    1
    y    3
12  x    3
    y    5
    x    3
    y    5
dtype: int64

The stack and group by sum are just the same.

However, the one I expect is

11  x    2
11  y    6
12  x    6
12  y    10

EDIT 2:

x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())

output:

11  x    1
    y    3
    x    1
    y    3
12  x    3
    y    5
    x    3
    y    5
dtype: int64

EDIT3: An issue has been logged https://github.com/pydata/pandas/issues/10417

like image 502
vumaasha Avatar asked Jun 23 '15 18:06

vumaasha


People also ask

How do you group by and count in pandas?

Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.

Can you group by two things in pandas?

Grouping by multiple columns with multiple aggregations functions. Can you groupby your data set multiple columns in Pandas? You bet! Here's an example of multiple aggregations per grouping, each with their specific calculated function: a sum of the aggregating column and an average calculation.

How do I group values in a column in pandas?

Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.


2 Answers

With pandas 0.16.2 and Python 3, I was able to get the correct result via:

x.stack().reset_index().groupby(['level_0','level_1']).sum()

Which produces:

                    0
level_0 level_1 
     11       x     2
              y     6
     12       x     6
              y     10

You can then change the index and column names to more desirable ones using reindex() and columns.

Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series, which is what x.stack() produces. My workaround is to turn the Series into a DataFrame via reset_index(). In this case the DataFrame does not have a MultiIndex anymore - I'm just grouping on labeled columns.

To make sure that grouping and summing works on a DataFrame with a MultiIndex, you can try this to get the same correct output:

x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()

Either of these workarounds should take care of things until the bug is resolved.

I wonder if the bug has something to do with the MultiIndex instances that are created on a Series vs. a DataFrame. For example:

In[1]: obj = x.stack()
       type(obj)
Out[1]: pandas.core.series.Series

In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])

vs.

In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
       type(obj)
Out[3]: pandas.core.frame.DataFrame

In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['level_0', 'level_1'])

Notice how the MultiIndex on the DataFrame describes the levels more correctly.

like image 53
sparc_spread Avatar answered Sep 29 '22 21:09

sparc_spread


sum allows you to specify the levels to sum over in a MultiIndex data frame.

x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()

y.sum(level=[0,1])

11  x     2
    y     6
12  x     6
    y    10
like image 31
Tai Avatar answered Sep 29 '22 21:09

Tai