Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas equivalent to R's group_by, mutate, and ifelse

Tags:

python

pandas

r

Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame:

import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
        "boolean": [True, True, True, True, True, False],
        "numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df

I want to 1) group by letters, 2) take the mean of numbers if all values in boolean have the same value. In R I would write:

library(dplyr)
df %>% 
  group_by(letters) %>%
  mutate(
    condition = n_distinct(boolean) == 1,
    numbers = ifelse(condition, mean(numbers), numbers)
  ) %>% 
  select(-condition)

This would result in the following output:

# A tibble: 6 x 3
# Groups:   letters [2]
  letters boolean numbers
  <chr>   <lgl>     <dbl>
1 a       TRUE          2
2 a       TRUE          2
3 a       TRUE          2
4 b       TRUE          1
5 b       TRUE          2
6 b       FALSE         3

How would you do it using Python pandas?

like image 205
Kjetil Avatar asked Dec 14 '21 21:12

Kjetil


People also ask

Is Pandas similar to dplyr?

Learn More. Heey great post, but pandas has very similar functions as dplyr. If you use those instead, you get statements very similar to your dplyr statements and you would get the same readability.

Is there a dplyr for Python?

Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.

Is Pandas GroupBy efficient?

Groupby is a very popular function in Pandas. This is very good at summarising, transforming, filtering, and a few other very essential data analysis tasks.

Is pandas for Python better than R for data analysis and manipulation?

Python and R are the two key players in the data science ecosystem. Both of these programming languages offer a rich selection of highly useful libraries. When it comes to data analysis and manipulation, two libraries stand out: “data.table” for R and Pandas for Python. I have been using both but I cannot really declare one superior to the other.

How to replicate the combination of groupby() and mutate() in pandas?

According to this thread on pandas github we can use the transform () method to replicate the combination of dplyr::groupby () and dplyr::mutate (). For this example, it would look as follows:

How do I evaluate an expression in R using PANDAS?

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so: In pandas the equivalent expression, using the eval () method, would be: In certain cases eval () will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.

How to map data structures from R to Python?

The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python. An expression using a data.frame called df in R where you want to summarize x by month: In pandas the equivalent expression, using the groupby () method, would be:


Video Answer


3 Answers

We can use lazy groupby and transform:

g = df.groupby('letters')

df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')

Output:

  letters  boolean  numbers
0       a     True        2
1       a     True        2
2       a     True        2
3       b     True        1
4       b     True        2
5       b    False        3
like image 127
Quang Hoang Avatar answered Oct 22 '22 06:10

Quang Hoang


Another way would be to use np.where. where a group has one unique value, find mean. Where it doesnt keep the numbers. Code below

df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])



letters  boolean  numbers
0       a     True      2.0
1       a     True      2.0
2       a     True      2.0
3       b     True      1.0
4       b     True      2.0
5       b    False      3.0

Alternatively, mask where condition does not apply as you compute the mean.

m=df.groupby('letters')['boolean'].transform('nunique')==1

df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')
like image 25
wwnde Avatar answered Oct 22 '22 08:10

wwnde


Since you are comparing drectly to R, I would prefer to use siuba rather than pandas:

from siuba import mutate, if_else, _, select, group_by, ungroup

df1 = df >>\
    group_by(_.letters) >> \
    mutate( condition = _.boolean.unique().size == 1, 
            numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
          ) >>\
    ungroup() >> select(-_.condition)

print(df1)
letters  boolean  numbers
0       a     True      2.0
1       a     True      2.0
2       a     True      2.0
3       b     True      1.0
4       b     True      2.0
5       b    False      3.0

Note that >> is the pipe. I added \ in order to jump to the next line. Also note that to refer to the variables you use _.variable

EDIT

It seems your R code has an issue, In R, you should rather use condition = all(boolean) instead of the code you have. That will translate to condition = boolean.all() in Python

like image 4
KU99 Avatar answered Oct 22 '22 06:10

KU99