I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute. <pre class="prettyprint"><code>x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12]) y = x.stack() print(y) print(y.groupby(level=[0,1]).sum()) </code></pre> output <pre class="prettyprint"><code>11 x 1 y 3 x 1 y 3 12 x 3 y 5 x 3 y 5 dtype: int64 11 x 1 y 3 x 1 y 3 12 x 3 y 5 x 3 y 5 dtype: int64 </code></pre> The stack and group by sum are just the same. However, the one I expect is <pre class="prettyprint"><code>11 x 2 11 y 6 12 x 6 12 y 10 </code></pre> EDIT 2: <pre class="prettyprint"><code>x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12]) y = x.stack().groupby(level=[0,1]).sum() print(y.groupby(level=[0,1]).sum()) </code></pre> output: <pre class="prettyprint"><code>11 x 1 y 3 x 1 y 3 12 x 3 y 5 x 3 y 5 dtype: int64 </code></pre> EDIT3: An issue has been logged https://github.com/pydata/pandas/issues/10417

With pandas 0.16.2 and Python 3, I was able to get the correct result via: <pre class="prettyprint"><code>x.stack().reset_index().groupby(['level_0','level_1']).sum() </code></pre> Which produces: <pre class="prettyprint"><code> 0 level_0 level_1 11 x 2 y 6 12 x 6 y 10 </code></pre> You can then change the index and column names to more desirable ones using <code>reindex()</code> and <code>columns</code>. Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on <code>Series</code>, which is what <code>x.stack()</code> produces. My workaround is to turn the <code>Series</code> into a <code>DataFrame</code> via <code>reset_index()</code>. In this case the <code>DataFrame</code> does not have a <code>MultiIndex</code> anymore - I'm just grouping on labeled columns. To make sure that grouping and summing works on a <code>DataFrame</code> with a <code>MultiIndex</code>, you can try this to get the same correct output: <pre class="prettyprint"><code>x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\ groupby(level=[0,1]).sum() </code></pre> Either of these workarounds should take care of things until the bug is resolved. I wonder if the bug has something to do with the <code>MultiIndex</code> instances that are created on a <code>Series</code> vs. a <code>DataFrame</code>. For example: <pre class="prettyprint"><code>In[1]: obj = x.stack() type(obj) Out[1]: pandas.core.series.Series In[2]: obj.index Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]]) </code></pre> vs. <pre class="prettyprint"><code>In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True) type(obj) Out[3]: pandas.core.frame.DataFrame In[4]: obj.index Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']], labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]], names=['level_0', 'level_1']) </code></pre> Notice how the <code>MultiIndex</code> on the <code>DataFrame</code> describes the levels more correctly.

<code>sum</code> allows you to specify the levels to sum over in a MultiIndex data frame. <pre class="prettyprint"><code>x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12]) y = x.stack() y.sum(level=[0,1]) 11 x 2 y 6 12 x 6 y 10 </code></pre>

Pandas: How to group by and sum MultiIndex

I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.

x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())

output

11  x    1
    y    3
    x    1
    y    3
12  x    3
    y    5
    x    3
    y    5
dtype: int64
11  x    1
    y    3
    x    1
    y    3
12  x    3
    y    5
    x    3
    y    5
dtype: int64

The stack and group by sum are just the same.

However, the one I expect is

11  x    2
11  y    6
12  x    6
12  y    10

EDIT 2:

x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())

output:

11  x    1
    y    3
    x    1
    y    3
12  x    3
    y    5
    x    3
    y    5
dtype: int64

EDIT3: An issue has been logged https://github.com/pydata/pandas/issues/10417

How do you group by and count in pandas?

Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.

Can you group by two things in pandas?

Grouping by multiple columns with multiple aggregations functions. Can you groupby your data set multiple columns in Pandas? You bet! Here's an example of multiple aggregations per grouping, each with their specific calculated function: a sum of the aggregating column and an average calculation.

How do I group values in a column in pandas?

Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.

With pandas 0.16.2 and Python 3, I was able to get the correct result via:

x.stack().reset_index().groupby(['level_0','level_1']).sum()

Which produces:

                    0
level_0 level_1 
     11       x     2
              y     6
     12       x     6
              y     10

You can then change the index and column names to more desirable ones using reindex() and columns.

Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series, which is what x.stack() produces. My workaround is to turn the Series into a DataFrame via reset_index(). In this case the DataFrame does not have a MultiIndex anymore - I'm just grouping on labeled columns.

To make sure that grouping and summing works on a DataFrame with a MultiIndex, you can try this to get the same correct output:

x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()

Either of these workarounds should take care of things until the bug is resolved.

I wonder if the bug has something to do with the MultiIndex instances that are created on a Series vs. a DataFrame. For example:

In[1]: obj = x.stack()
       type(obj)
Out[1]: pandas.core.series.Series

In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])

vs.

In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
       type(obj)
Out[3]: pandas.core.frame.DataFrame

In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['level_0', 'level_1'])

Notice how the MultiIndex on the DataFrame describes the levels more correctly.

sum allows you to specify the levels to sum over in a MultiIndex data frame.

x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()

y.sum(level=[0,1])

11  x     2
    y     6
12  x     6
    y    10

Pandas: How to group by and sum MultiIndex

Tags:

python

pandas

multi-index

vumaasha

People also ask

2 Answers

sparc_spread

Tai

Recent Activity

Donate For Us

Pandas: How to group by and sum MultiIndex

Tags:

python

pandas

multi-index

vumaasha

People also ask

2 Answers

sparc_spread

Tai

Related questions

Recent Activity

Donate For Us