Is there a Python pandas function similar to R's dplyr::mutate()
, which can add a new column to grouped data by applying a function on one of the columns of the grouped data? Below is the detailed explanation of the problem:
I generated sample data using this code:
x <- data.frame(country = rep(c("US", "UK"), 5), state = c(letters[1:10]), pop=sample(10000:50000,10))
Now, I want to add a new column which has maximum population for US and UK. I can do it using following R code...
x <- group_by(x, country)
x <- mutate(x,max_pop = max(pop))
x <- arrange(x, country)
...or equivalently, using the R dplyr pipe operator:
x %>% group_by(country) %>% mutate(max_pop = max(pop)) %>% arrange(country)
So my question is how do I do it in Python using pandas? I tried following but it did not work
x['max_pop'] = x.groupby('country').pop.apply(max)
Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.
Add new columns with mutate()mutate() allows you to create new columns in the DataFrame. The new columns can be composed from existing columns. For example, let's create two new columns: one by dividing the distance column by 1000 , and the other by concatenating the carrier and origin columns.
Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That's why we won't spend any time exploring connection options but will use a build-in dataset instead. There's no winner in this Pandas vs. dplyr comparison, as both libraries are near identical with the syntax.
One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python.
you want to use transform
. transform
will return an object with the same index as what's being grouped which makes it easy to assign back as a new column in that object if it's a dataframe.
x['max_pop'] = x.groupby('country').pop.transform('max')
Setup
import pandas as pd
x = pd.DataFrame(dict(
country=['US','UK','US','UK'],
state=['a','b','c','d'],
pop=[37088, 46987, 17116, 20484]
))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With