Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substitute for mutate (dplyr package) in python pandas

Is there a Python pandas function similar to R's dplyr::mutate(), which can add a new column to grouped data by applying a function on one of the columns of the grouped data? Below is the detailed explanation of the problem:

I generated sample data using this code:

x <- data.frame(country = rep(c("US", "UK"), 5), state = c(letters[1:10]), pop=sample(10000:50000,10))

Now, I want to add a new column which has maximum population for US and UK. I can do it using following R code...

x <- group_by(x, country)
x <- mutate(x,max_pop = max(pop))
x <- arrange(x, country)

...or equivalently, using the R dplyr pipe operator:

x %>% group_by(country) %>% mutate(max_pop = max(pop)) %>% arrange(country)

So my question is how do I do it in Python using pandas? I tried following but it did not work

x['max_pop'] = x.groupby('country').pop.apply(max)
like image 229
saurav shekhar Avatar asked Dec 14 '16 16:12

saurav shekhar


People also ask

Is there a dplyr equivalent in Python?

Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.

Is there a mutate function in Python?

Add new columns with mutate()mutate() allows you to create new columns in the DataFrame. The new columns can be composed from existing columns. For example, let's create two new columns: one by dividing the distance column by 1000 , and the other by concatenating the carrier and origin columns.

Is pandas similar to dplyr?

Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That's why we won't spend any time exploring connection options but will use a build-in dataset instead. There's no winner in this Pandas vs. dplyr comparison, as both libraries are near identical with the syntax.

Is there a Tidyverse for Python?

One of the great things about the R world has been a collection of R packages called tidyverse that are easy for beginners to learn and provide a consistent data manipulation and visualisation space. The value of these tools has been so great that many of them have been ported to Python.


1 Answers

you want to use transform. transform will return an object with the same index as what's being grouped which makes it easy to assign back as a new column in that object if it's a dataframe.

x['max_pop'] = x.groupby('country').pop.transform('max')

Setup

import pandas as pd 

x = pd.DataFrame(dict(
    country=['US','UK','US','UK'],
    state=['a','b','c','d'],
    pop=[37088, 46987, 17116, 20484]
))
like image 58
piRSquared Avatar answered Oct 14 '22 23:10

piRSquared