Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame: <pre class="prettyprint"><code>import pandas as pd data = {"letters": ["a", "a", "a", "b", "b", "b"], "boolean": [True, True, True, True, True, False], "numbers": [1, 2, 3, 1, 2, 3]} df = pd.DataFrame(data) df </code></pre> I want to 1) group by letters, 2) take the mean of numbers if all values in boolean have the same value. In R I would write: <pre class="prettyprint"><code>library(dplyr) df %>% group_by(letters) %>% mutate( condition = n_distinct(boolean) == 1, numbers = ifelse(condition, mean(numbers), numbers) ) %>% select(-condition) </code></pre> This would result in the following output: <pre class="prettyprint"><code># A tibble: 6 x 3 # Groups: letters [2] letters boolean numbers <chr> <lgl> <dbl> 1 a TRUE 2 2 a TRUE 2 3 a TRUE 2 4 b TRUE 1 5 b TRUE 2 6 b FALSE 3 </code></pre> How would you do it using Python pandas?

We can use lazy <code>groupby</code> and <code>transform</code>: <pre class="prettyprint"><code>g = df.groupby('letters') df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean') </code></pre> Output: <pre class="prettyprint"><code> letters boolean numbers 0 a True 2 1 a True 2 2 a True 2 3 b True 1 4 b True 2 5 b False 3 </code></pre>

Python pandas equivalent to R's group_by, mutate, and ifelse

Tags:

python

pandas

r

Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame:

import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
        "boolean": [True, True, True, True, True, False],
        "numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df

I want to 1) group by letters, 2) take the mean of numbers if all values in boolean have the same value. In R I would write:

library(dplyr)
df %>% 
  group_by(letters) %>%
  mutate(
    condition = n_distinct(boolean) == 1,
    numbers = ifelse(condition, mean(numbers), numbers)
  ) %>% 
  select(-condition)

This would result in the following output:

# A tibble: 6 x 3
# Groups:   letters [2]
  letters boolean numbers
  <chr>   <lgl>     <dbl>
1 a       TRUE          2
2 a       TRUE          2
3 a       TRUE          2
4 b       TRUE          1
5 b       TRUE          2
6 b       FALSE         3

How would you do it using Python pandas?

205

asked Dec 14 '21 21:12

Kjetil

Video Answer

3 Answers

We can use lazy groupby and transform:

g = df.groupby('letters')

df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')

Output:

  letters  boolean  numbers
0       a     True        2
1       a     True        2
2       a     True        2
3       b     True        1
4       b     True        2
5       b    False        3

127

answered Oct 22 '22 06:10

Quang Hoang

Another way would be to use np.where. where a group has one unique value, find mean. Where it doesnt keep the numbers. Code below

df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])



letters  boolean  numbers
0       a     True      2.0
1       a     True      2.0
2       a     True      2.0
3       b     True      1.0
4       b     True      2.0
5       b    False      3.0

Alternatively, mask where condition does not apply as you compute the mean.

m=df.groupby('letters')['boolean'].transform('nunique')==1

df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')

answered Oct 22 '22 08:10

wwnde

Since you are comparing drectly to R, I would prefer to use siuba rather than pandas:

from siuba import mutate, if_else, _, select, group_by, ungroup

df1 = df >>\
    group_by(_.letters) >> \
    mutate( condition = _.boolean.unique().size == 1, 
            numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
          ) >>\
    ungroup() >> select(-_.condition)

print(df1)
letters  boolean  numbers
0       a     True      2.0
1       a     True      2.0
2       a     True      2.0
3       b     True      1.0
4       b     True      2.0
5       b    False      3.0

Note that >> is the pipe. I added \ in order to jump to the next line. Also note that to refer to the variables you use _.variable

EDIT

It seems your R code has an issue, In R, you should rather use condition = all(boolean) instead of the code you have. That will translate to condition = boolean.all() in Python

answered Oct 22 '22 06:10

KU99

Related questions
                            
                                Python Selenium --user-data-dir option ERROR: could not remove old devtools port file
                            
                                Altair: Can't facet layered plots
                            
                                asyncio: collecting results from an async function in an executor
                            
                                Most efficient way to use a large data set for PyTorch?
                            
                                TypeError: Required argument 'mat' (pos 2) not found
                            
                                Converting pandas column of comma-separated strings into integers
                            
                                Bulk Generate Pre-Signed URLs boto3
                            
                                Pandas, merging two dataframes on multiple columns, and multiplying result
                            
                                UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: character maps to <undefined>
                            
                                Flask-SQLAlchemy: Neither SQLALCHEMY_DATABASE_URI nor SQLALCHEMY_BINDS is set. Defaulting SQLALCHEMY_DATABASE_URI to "sqlite:///:memory:"
                            
                                What does for x, *y in list mean in python
                            
                                Pandas Dataframe Error 'StringArray requires a sequence of strings or pandas.NA'
                            
                                Cast and type env variables using file
                            
                                Sums of variable size chunks of a list where sizes are given by other list
                            
                                SplashRequest gives - TypeError: attrs() got an unexpected keyword argument 'eq'
                            
                                Change pytest working directory to test case directory
                            
                                Plotly: How to define colors in a figure using plotly.graph_objects and plotly.express?
                            
                                TypeError: Singleton array array(True) cannot be considered a valid collection
                            
                                Plotly: How to combine scatter and line plots using Plotly Express?
                            
                                ValueError: invalid literal for int() with base 10: '30.0' when running unittest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With