Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a column based on multiple conditions

I'm a longtime SAS user trying to get into Pandas. I'd like to set a column's value based on a variety of if conditions. I think I can do it using nested np.where commands but thought I'd check if there's a more elegant solution. For instance, if I set a left bound and right bound, and want to return a column of string values for if x is left, middle, or right of these boundaries, what is the best way to do it? Basically if x < lbound return "left", else if lbound < x < rbound return "middle", else if x > rbound return "right".

df
   lbound   rbound  x
0   -1      1       0
1   5       7       1
2   0       1       2

Can check for one condition by using np.where:

df['area'] = np.where(df['x']>df['rbound'],'right','somewhere else')

But not sure what to do it I want to check multiple if-else ifs in a single line.

Output should be:

df
   lbound   rbound  x    area
0   -1      1       0    middle
1   5       7       1    left
2   0       1       2    right
like image 608
Nathan Przybylo Avatar asked Mar 09 '18 01:03

Nathan Przybylo


People also ask

How do you create a new column based on multiple conditions PySpark?

Add Multiple Columns using Map You can add multiple columns to PySpark DataFrame in several ways if you wanted to add a known set of columns you can easily do it by chaining withColumn() or using select() .

How do you use multiple conditions in a data frame?

Using Loc to Filter With Multiple ConditionsAdd each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows. Don't forget to include "import pandas as pd" at the top!


2 Answers

Option 1

You can use nested np.where statements. For example:

df['area'] = np.where(df['x'] > df['rbound'], 'right', 
                      np.where(df['x'] < df['lbound'],
                               'left', 'somewhere else'))

Option 2

You can use .loc accessor to assign specific ranges. Note you will have to add the new column before use. We take this opportunity to set the default, which may be overwritten later.

df['area'] = 'somewhere else'
df.loc[df['x'] > df['rbound'], 'area'] = 'right'
df.loc[df['x'] < df['lbound'], 'area'] = 'left'

Explanation

These are both valid alternatives with comparable performance. The calculations are vectorised in both instances. My preference is for Option 2 as it seems more readable. If there are a large number of nested criteria, np.where may be more convenient.

like image 192
jpp Avatar answered Sep 28 '22 22:09

jpp


You can use numpy select instead of np.where

cond = [df['x'].between(df['lbound'], df['rbound']), (df['x'] < df['lbound']) , df['x'] > df['rbound'] ]
output = [ 'middle', 'left', 'right']

df['area'] = np.select(cond, output, default=np.nan)



    lbound  rbound  x   area
0   -1      1       0   middle
1   5       7       1   left
2   0       1       2   right
like image 24
Vaishali Avatar answered Sep 28 '22 21:09

Vaishali