Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas add column with value based on condition based on other columns

Tags:

python

pandas

I have the following pandas dataframe:

enter image description here

import pandas as pd import numpy as np  d = {'age' : [21, 45, 45, 5],      'salary' : [20, 40, 10, 100]}  df = pd.DataFrame(d) 

and would like to add an extra column called "is_rich" which captures if a person is rich depending on his/her salary. I found multiple ways to accomplish this:

# method 1 df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')  # method 2 df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]  # method 3 df['is_rich_method3'] = 'no' df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes' 

resulting in:

enter image description here

However I don't understand what the preferred way is. Are all methods equally good depending on your application?

like image 300
Rutger Hofste Avatar asked May 16 '18 16:05

Rutger Hofste


People also ask

How do I add a conditional column in pandas?

You can create a conditional column in pandas DataFrame by using np. where() , np. select() , DataFrame. map() , DataFrame.


1 Answers

Use the timeits, Luke!

enter image description here

Conclusion
List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, loc and numpy.where perform better - vectorisation wins the day.

Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.

One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.


Benchmarking code, for reference. Here are the functions being timed:

def numpy_where(df):   return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))  def list_comp(df):   return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])  def loc(df):   df = df.assign(is_rich='no')   df.loc[df['salary'] > 50, 'is_rich'] = 'yes'   return df 
like image 195
cs95 Avatar answered Sep 18 '22 23:09

cs95