Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

case_when function from R to Python

How I can implement the case_when function of R in a python code?

Here is the case_when function of R:

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when

as a minimum working example suppose we have the following dataframe (python code follows):

import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:

if age < 10 then baby
 if age >= 10 and age < 20 then kid 
if age >=20 and age < 30 then young 
if age >= 30 and age < 50 then mature 
if age >= 50 then grandpa 

Can someone help on this ?

like image 833
msh855 Avatar asked Feb 12 '19 15:02

msh855


3 Answers

You want to use np.select:

conditions = [
    (df["age"].lt(10)),
    (df["age"].ge(10) & df["age"].lt(20)),
    (df["age"].ge(20) & df["age"].lt(30)),
    (df["age"].ge(30) & df["age"].lt(50)),
    (df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]

df["elderly"] = np.select(conditions, choices)

# Results in:
#      name  age  preTestScore  postTestScore  elderly
#  0  Jason   42             4             25   mature
#  1  Molly   52            24             94  grandpa
#  2   Tina   36            31             57   mature
#  3   Jake   24             2             62    young
#  4    Amy   73             3             70  grandpa

The conditions and choices lists must be the same length.
There is also a default parameter that is used when all conditions evaluate to False.

like image 93
Alex Avatar answered Nov 07 '22 21:11

Alex


np.select is great because it's a general way to assign values to elements in choicelist depending on conditions.

However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut method.


bin_cond = [-np.inf, 10, 20, 30, 50, np.inf]            # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)

#     name  age  preTestScore  postTestScore  elderly elderly2
# 0  Jason   42             4             25   mature   mature
# 1  Molly   52            24             94  grandpa  grandpa
# 2   Tina   36            31             57   mature   mature
# 3   Jake   24             2             62    young    young
# 4    Amy   73             3             70  grandpa  grandpa
like image 10
Alby Avatar answered Nov 07 '22 21:11

Alby


pyjanitor has a case_when implementation in dev that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable and fcase in R's data.table; under the hood, it uses pd.Series.mask:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
 df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa',  # default if none of the conditions match
 column_name = 'elderly') # column name to assign to
 
    name  age  preTestScore  postTestScore  elderly
0  Jason   42             4             25   mature
1  Molly   52            24             94  grandpa
2   Tina   36            31             57   mature
3   Jake   24             2             62    young
4    Amy   73             3             70  grandpa

Alby's solution is more efficient for this use case than an if/else construct.

like image 2
sammywemmy Avatar answered Nov 07 '22 21:11

sammywemmy