How I can implement the case_when function of R in a python code?
Here is the case_when function of R:
https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when
as a minimum working example suppose we have the following dataframe (python code follows):
import pandas as pd
import numpy as np
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:
if age < 10 then baby
if age >= 10 and age < 20 then kid
if age >=20 and age < 30 then young
if age >= 30 and age < 50 then mature
if age >= 50 then grandpa
Can someone help on this ?
You want to use np.select
:
conditions = [
(df["age"].lt(10)),
(df["age"].ge(10) & df["age"].lt(20)),
(df["age"].ge(20) & df["age"].lt(30)),
(df["age"].ge(30) & df["age"].lt(50)),
(df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]
df["elderly"] = np.select(conditions, choices)
# Results in:
# name age preTestScore postTestScore elderly
# 0 Jason 42 4 25 mature
# 1 Molly 52 24 94 grandpa
# 2 Tina 36 31 57 mature
# 3 Jake 24 2 62 young
# 4 Amy 73 3 70 grandpa
The conditions
and choices
lists must be the same length.
There is also a default
parameter that is used when all conditions
evaluate to False
.
np.select
is great because it's a general way to assign values to elements in choicelist depending on conditions.
However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut
method.
bin_cond = [-np.inf, 10, 20, 30, 50, np.inf] # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)
# name age preTestScore postTestScore elderly elderly2
# 0 Jason 42 4 25 mature mature
# 1 Molly 52 24 94 grandpa grandpa
# 2 Tina 36 31 57 mature mature
# 3 Jake 24 2 62 young young
# 4 Amy 73 3 70 grandpa grandpa
pyjanitor has a case_when implementation in dev
that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable
and fcase in R's data.table
; under the hood, it uses pd.Series.mask:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa', # default if none of the conditions match
column_name = 'elderly') # column name to assign to
name age preTestScore postTestScore elderly
0 Jason 42 4 25 mature
1 Molly 52 24 94 grandpa
2 Tina 36 31 57 mature
3 Jake 24 2 62 young
4 Amy 73 3 70 grandpa
Alby's solution is more efficient for this use case than an if/else construct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With