I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())
output
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
The stack and group by sum are just the same.
However, the one I expect is
11 x 2
11 y 6
12 x 6
12 y 10
EDIT 2:
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
output:
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
EDIT3: An issue has been logged https://github.com/pydata/pandas/issues/10417
Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.
Grouping by multiple columns with multiple aggregations functions. Can you groupby your data set multiple columns in Pandas? You bet! Here's an example of multiple aggregations per grouping, each with their specific calculated function: a sum of the aggregating column and an average calculation.
Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
With pandas 0.16.2 and Python 3, I was able to get the correct result via:
x.stack().reset_index().groupby(['level_0','level_1']).sum()
Which produces:
0
level_0 level_1
11 x 2
y 6
12 x 6
y 10
You can then change the index and column names to more desirable ones using reindex()
and columns
.
Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series
, which is what x.stack()
produces. My workaround is to turn the Series
into a DataFrame
via reset_index()
. In this case the DataFrame
does not have a MultiIndex
anymore - I'm just grouping on labeled columns.
To make sure that grouping and summing works on a DataFrame
with a MultiIndex
, you can try this to get the same correct output:
x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()
Either of these workarounds should take care of things until the bug is resolved.
I wonder if the bug has something to do with the MultiIndex
instances that are created on a Series
vs. a DataFrame
. For example:
In[1]: obj = x.stack()
type(obj)
Out[1]: pandas.core.series.Series
In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
vs.
In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
type(obj)
Out[3]: pandas.core.frame.DataFrame
In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['level_0', 'level_1'])
Notice how the MultiIndex
on the DataFrame
describes the levels more correctly.
sum
allows you to specify the levels to sum over in a MultiIndex data frame.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
y.sum(level=[0,1])
11 x 2
y 6
12 x 6
y 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With