In Python Pandas, I want to add columns by executing multiple aggregate functions on multiple columns like R dplyr mutate_each. For example, Can Python Pandas realize the same processing as the following R script?
R dplyr :
iris %>%
group_by(Species) %>%
mutate_each(funs(min, max, mean), starts_with("Sepal"))
However, I was able to achieve the same processing as mutate with Pandas. As shown in the code below, I could execute one aggregate function and add one column.
R dplyr :
iris %>% group_by(Species) %>% mutate(MaxSepalLen = max(Sepal.Length))
Python Pandas :
iris.assign(MaxSepalLen = iris.groupby("Species")["Sepal.Length"].transform('max'))
With Pandas, this can be accomplished in a more lenghty way.
First, let's prepare the data:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris_data = load_iris()
iris = pd.DataFrame(iris_data.data, columns = [c[0:3] + c[6] for c in iris_data.feature_names])
iris['Species'] = iris_data.target_names[iris_data.target]
Now we can imitate the mutate_each
pipeline:
# calculate the aggregates
pivot = iris.groupby("Species")[iris.columns[iris.columns.str.startswith('sepal')]
].aggregate(['min', 'max', np.mean])
# name the aggregates
pivot.columns = pivot.columns.get_level_values(0) + pivot.columns.get_level_values(1)
# merge aggregates with the original dataframe
new_iris = iris.merge(pivot, left_on='Species', right_index=True)
The pivot
table is really a small pivot table:
seplmin seplmax seplmean sepwmin sepwmax sepwmean
Species
setosa 4.3 5.8 5.006 2.3 4.4 3.418
versicolor 4.9 7.0 5.936 2.0 3.4 2.770
virginica 4.9 7.9 6.588 2.2 3.8 2.974
And the new_iris
is a 150x11 table with all columns from iris
and pivot
combined, identical to what dplyr
outputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With