I'm trying to apply a custom function in pandas similar to the groupby and mutate functionality in dplyr.
What I'm trying to do is say given a pandas dataframe like this:
df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
'category2':['a', 'b', 'a', 'b', 'a', 'b'],
'var1':np.random.randint(0,100,6),
'var2':np.random.randint(0,100,6)}
)
df
category1 category2 var1 var2
0 a a 23 59
1 a b 54 20
2 a a 48 62
3 b b 45 76
4 b a 60 26
5 b b 13 70
apply some function that returns the same number of elements as the number of elements in the group by:
def myfunc(s):
return [np.mean(s)] * len(s)
to get this result
df
category1 category2 var1 var2 var3
0 a a 23 59 35.5
1 a b 54 20 54
2 a a 48 62 35.5
3 b b 45 76 29
4 b a 60 26 60
5 b b 13 70 29
I was thinking of something along the lines of:
df['var3'] = df.groupby(['category1', 'category2'], group_keys=False).apply(lambda x: myfunc(x.var1))
but haven't been able to get the index to match.
In R with dplyr this would be
df <- df %>%
group_by(category1, category2) %>%
mutate(
var3 = myfunc(var1)
)
So I was able to solve it by using a custom function like:
def myfunc_data(data):
data['var3'] = myfunc(data.var1)
return data
and
df = df.groupby(['category1', 'category2']).apply(myfunc_data)
but I guess I was still wondering if there's a way to do it without defining this custom function.
Simply use the apply method to each dataframe in the groupby object. This is the most straightforward way and the easiest to understand. Notice that the function takes a dataframe as its only argument, so any code within the custom function needs to work on a pandas dataframe.
Apply function func group-wise and combine the results together. The function passed to apply must take a dataframe as its first argument and return a dataframe, a series or a scalar. apply will then take care of combining the results back together into a single dataframe or series.
There are generally 3 ways to apply custom functions in Pandas: map , apply , and applymap . map works element-wise on a series, and is optimized for mapping values to a series (e.g. one column of a DataFrame). applymap works element-wise on a DataFrame, and is optimized for mapping values to a DataFrame.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
Use GroupBy.transform
for return Series
with same size like original DataFrame
, so possible assign to new column:
np.random.seed(123)
df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
'category2':['a', 'b', 'a', 'b', 'a', 'b'],
'var1':np.random.randint(0,100,6),
'var2':np.random.randint(0,100,6)}
)
df['var3'] = df.groupby(['category1', 'category2'])['var1'].transform(myfunc)
print (df)
category1 category2 var1 var2 var3
0 a a 66 86 82
1 a b 92 97 92
2 a a 98 96 82
3 b b 17 47 37
4 b a 83 73 83
5 b b 57 32 37
Alternative with lambda function
:
df['var3'] = (df.groupby(['category1', 'category2'])['var1']
.transform(lambda s: [np.mean(s)] * len(s)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With