I'm a longtime SAS user trying to get into Pandas. I'd like to set a column's value based on a variety of if conditions. I think I can do it using nested np.where commands but thought I'd check if there's a more elegant solution. For instance, if I set a left bound and right bound, and want to return a column of string values for if x is left, middle, or right of these boundaries, what is the best way to do it? Basically if x < lbound return "left", else if lbound < x < rbound return "middle", else if x > rbound return "right".
df
lbound rbound x
0 -1 1 0
1 5 7 1
2 0 1 2
Can check for one condition by using np.where:
df['area'] = np.where(df['x']>df['rbound'],'right','somewhere else')
But not sure what to do it I want to check multiple if-else ifs in a single line.
Output should be:
df
lbound rbound x area
0 -1 1 0 middle
1 5 7 1 left
2 0 1 2 right
Add Multiple Columns using Map You can add multiple columns to PySpark DataFrame in several ways if you wanted to add a known set of columns you can easily do it by chaining withColumn() or using select() .
Using Loc to Filter With Multiple ConditionsAdd each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows. Don't forget to include "import pandas as pd" at the top!
Option 1
You can use nested np.where
statements. For example:
df['area'] = np.where(df['x'] > df['rbound'], 'right',
np.where(df['x'] < df['lbound'],
'left', 'somewhere else'))
Option 2
You can use .loc
accessor to assign specific ranges. Note you will have to add the new column before use. We take this opportunity to set the default, which may be overwritten later.
df['area'] = 'somewhere else'
df.loc[df['x'] > df['rbound'], 'area'] = 'right'
df.loc[df['x'] < df['lbound'], 'area'] = 'left'
Explanation
These are both valid alternatives with comparable performance. The calculations are vectorised in both instances. My preference is for Option 2 as it seems more readable. If there are a large number of nested criteria, np.where
may be more convenient.
You can use numpy select instead of np.where
cond = [df['x'].between(df['lbound'], df['rbound']), (df['x'] < df['lbound']) , df['x'] > df['rbound'] ]
output = [ 'middle', 'left', 'right']
df['area'] = np.select(cond, output, default=np.nan)
lbound rbound x area
0 -1 1 0 middle
1 5 7 1 left
2 0 1 2 right
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With