To obtain results executed on groupby-data with the same level of detail as the original DataFrame (same observation count) I have used the transform function.
Example: Original dataframe
name, year, grade
Jack, 2010, 6
Jack, 2011, 7
Rosie, 2010, 7
Rosie, 2011, 8
After groupby transform
name, year, grade, average grade
Jack, 2010, 6, 6.5
Jack, 2011, 7, 6.5
Rosie, 2010, 7, 7.5
Rosie, 2011, 8, 7.5
However, with more advanced functions based on multiple columns things get more complicated. What puzzles me is that I seem to be unable to access multiple columns in a groupby-transform combination.
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
def f(x):
y=sum(x['a'])+sum(x['b'])
return(y)
df['e'] = df.groupby(['c','d']).transform(f)
Gives me:
KeyError: ('a', 'occurred at index a')
Though I know that following does work:
df.groupby(['c','d']).apply(f)
What causes this behavior and how can I obtain something like this:
a b c d e
1 1 q z 12
2 2 q z 12
3 3 q z 12
4 4 q o 8
5 5 w o 22
6 6 w o 22
First lets see how to group by a single column in a Pandas DataFrame you can use the next syntax: df.groupby(['publication']) Copy. In order to group by multiple columns you need to use the next syntax: df.groupby(['publication', 'date_m']) Copy. The columns should be provided as a list to the groupby method.
Applying multiple aggregation functions to a groupby is done by method: agg. Note: that another function aggregate exists which and agg is an alias for it. The functions can be passed as a list. The available aggregation functions for group by in Pandas are:
Often you may want to group and aggregate by multiple columns of a pandas DataFrame. Fortunately this is easy to do using the pandas .groupby () and .agg () functions. This tutorial explains several examples of how to use these functions in practice.
One useful way to inspect a Pandas GroupBy object and see the splitting in action is to iterate over it. This is implemented in DataFrameGroupBy.__iter__ () and produces an iterator of (group, DataFrame) pairs for DataFrames:
for this particular case you could do:
g = df.groupby(['c', 'd'])
df['e'] = g.a.transform('sum') + g.b.transform('sum')
df
# outputs
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
if you can construct the final result by a linear combination of the independent transforms on the same groupby, this method would work.
otherwise, you'd use a groupby-apply
and then merge back to the original df.
example:
_ = df.groupby(['c','d']).apply(lambda x: sum(x.a+x.b)).rename('e').reset_index()
df.merge(_, on=['c','d'])
# same output as above.
You can use GroupBy
+ transform
with sum
twice:
df['e'] = df.groupby(['c', 'd'])[['a', 'b']].transform('sum').sum(1)
print(df)
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With