I'm a longtime SAS user trying to get into Pandas. I'd like to set a column's value based on a variety of if conditions. I think I can do it using nested np.where commands but thought I'd check if there's a more elegant solution. For instance, if I set a left bound and right bound, and want to return a column of string values for if x is left, middle, or right of these boundaries, what is the best way to do it? Basically if x < lbound return "left", else if lbound < x < rbound return "middle", else if x > rbound return "right". <pre class="prettyprint"><code>df lbound rbound x 0 -1 1 0 1 5 7 1 2 0 1 2 </code></pre> Can check for one condition by using np.where: <pre class="prettyprint"><code>df['area'] = np.where(df['x']>df['rbound'],'right','somewhere else') </code></pre> But not sure what to do it I want to check multiple if-else ifs in a single line. Output should be: <pre class="prettyprint"><code>df lbound rbound x area 0 -1 1 0 middle 1 5 7 1 left 2 0 1 2 right </code></pre>

Option 1 You can use nested <code>np.where</code> statements. For example: <pre class="prettyprint"><code>df['area'] = np.where(df['x'] > df['rbound'], 'right', np.where(df['x'] < df['lbound'], 'left', 'somewhere else')) </code></pre> Option 2 You can use <code>.loc</code> accessor to assign specific ranges. Note you will have to add the new column before use. We take this opportunity to set the default, which may be overwritten later. <pre class="prettyprint"><code>df['area'] = 'somewhere else' df.loc[df['x'] > df['rbound'], 'area'] = 'right' df.loc[df['x'] < df['lbound'], 'area'] = 'left' </code></pre> Explanation These are both valid alternatives with comparable performance. The calculations are vectorised in both instances. My preference is for Option 2 as it seems more readable. If there are a large number of nested criteria, <code>np.where</code> may be more convenient.

You can use numpy select instead of np.where <pre class="prettyprint"><code>cond = [df['x'].between(df['lbound'], df['rbound']), (df['x'] < df['lbound']) , df['x'] > df['rbound'] ] output = [ 'middle', 'left', 'right'] df['area'] = np.select(cond, output, default=np.nan) lbound rbound x area 0 -1 1 0 middle 1 5 7 1 left 2 0 1 2 right </code></pre>

Creating a column based on multiple conditions

Tags:

python

pandas

dataframe

nested

I'm a longtime SAS user trying to get into Pandas. I'd like to set a column's value based on a variety of if conditions. I think I can do it using nested np.where commands but thought I'd check if there's a more elegant solution. For instance, if I set a left bound and right bound, and want to return a column of string values for if x is left, middle, or right of these boundaries, what is the best way to do it? Basically if x < lbound return "left", else if lbound < x < rbound return "middle", else if x > rbound return "right".

df
   lbound   rbound  x
0   -1      1       0
1   5       7       1
2   0       1       2

Can check for one condition by using np.where:

df['area'] = np.where(df['x']>df['rbound'],'right','somewhere else')

But not sure what to do it I want to check multiple if-else ifs in a single line.

Output should be:

df
   lbound   rbound  x    area
0   -1      1       0    middle
1   5       7       1    left
2   0       1       2    right

608

asked Mar 09 '18 01:03

Nathan Przybylo

2 Answers

Option 1

You can use nested np.where statements. For example:

df['area'] = np.where(df['x'] > df['rbound'], 'right', 
                      np.where(df['x'] < df['lbound'],
                               'left', 'somewhere else'))

Option 2

You can use .loc accessor to assign specific ranges. Note you will have to add the new column before use. We take this opportunity to set the default, which may be overwritten later.

df['area'] = 'somewhere else'
df.loc[df['x'] > df['rbound'], 'area'] = 'right'
df.loc[df['x'] < df['lbound'], 'area'] = 'left'

Explanation

These are both valid alternatives with comparable performance. The calculations are vectorised in both instances. My preference is for Option 2 as it seems more readable. If there are a large number of nested criteria, np.where may be more convenient.

192

answered Sep 28 '22 22:09

jpp

You can use numpy select instead of np.where

cond = [df['x'].between(df['lbound'], df['rbound']), (df['x'] < df['lbound']) , df['x'] > df['rbound'] ]
output = [ 'middle', 'left', 'right']

df['area'] = np.select(cond, output, default=np.nan)



    lbound  rbound  x   area
0   -1      1       0   middle
1   5       7       1   left
2   0       1       2   right

answered Sep 28 '22 21:09

Vaishali

Related questions
                            
                                Unsupported operand type(s) for *: map and map
                            
                                Numpy percentiles with linear interpolation - wrong value?
                            
                                Intermittent "getrandom() initialization failed" using scrapy spider
                            
                                is there a magic method for sorted() in Python?
                            
                                Remove non-ASCII characters from string columns in pandas
                            
                                "set_UVC" equivilent for a 3D quiver plot in matplotlib
                            
                                Creating a hotkey to enter text using python, running in background waiting for key-press
                            
                                Python Hadoop streaming on windows, Script not a valid Win32 application
                            
                                Pandas - add NaN for missing values when pd.merge
                            
                                What does Selenium .set_script_timeout(n) do and how is it different from driver.set_page_load_timeout(n)?
                            
                                Iteration over columns and rows in Pandas Dataframe
                            
                                Boto3 AWS API error responses for SSM
                            
                                ResultSet object has no attribute 'find_all'
                            
                                Tinting an image in Pygame
                            
                                How to append list of numerous types to single string (python)
                            
                                Is pandas showing the wrong percentile?
                            
                                What does RepeatedKFold actually mean?
                            
                                Regular expression must contain and may only contain
                            
                                Pythonic way to apply format to all strings in dictionary without f-strings
                            
                                How to support %x formatting on a class that emulates int

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With