Pandas add column with value based on condition based on other columns

Tags:

pandas

I have the following pandas dataframe:

enter image description here

import pandas as pd import numpy as np  d = {'age' : [21, 45, 45, 5],      'salary' : [20, 40, 10, 100]}  df = pd.DataFrame(d)

and would like to add an extra column called "is_rich" which captures if a person is rich depending on his/her salary. I found multiple ways to accomplish this:

# method 1 df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')  # method 2 df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]  # method 3 df['is_rich_method3'] = 'no' df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

resulting in:

enter image description here

However I don't understand what the preferred way is. Are all methods equally good depending on your application?

300

asked May 16 '18 16:05

1 Answers

Use the timeits, Luke!

enter image description here

Conclusion
List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, loc and numpy.where perform better - vectorisation wins the day.

Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.

One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.

Benchmarking code, for reference. Here are the functions being timed:

def numpy_where(df):   return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))  def list_comp(df):   return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])  def loc(df):   df = df.assign(is_rich='no')   df.loc[df['salary'] > 50, 'is_rich'] = 'yes'   return df

195

answered Sep 18 '22 23:09

cs95

Related questions
                            
                                How to create only one copy of graph in tensorboard events file with custom tf.Estimator?
                            
                                Insert image in matplotlib legend
                            
                                Python type annotation for sequences of strings, but not for strings?
                            
                                Python: what are the advantages of async over threads? [closed]
                            
                                What's the recommended way to unittest Python GUI applications?
                            
                                Reliable and efficient key--value database for Linux? [closed]
                            
                                Plot Interactive Decision Tree in Jupyter Notebook
                            
                                How to fix error "ERROR: Command errored out with exit status 1: python." when trying to install django-heroku using pip [duplicate]
                            
                                Relevance of typename in namedtuple
                            
                                Tensorflow: restoring a graph and model then running evaluation on a single image
                            
                                Consume multiple queues in python / pika
                            
                                large graph visualization with python and networkx
                            
                                Locating the centroid (center of mass) of spherical polygons
                            
                                Why is there no list.clear() method in python?
                            
                                Python Music Library? [closed]
                            
                                How do I call a Javascript function from Python?
                            
                                Is it wrong to use the "==" operator when comparing to an empty list? [duplicate]
                            
                                When should I ever use file.read() or file.readlines()?
                            
                                How do I set up a daemon with python-daemon?
                            
                                How does keras define "accuracy" and "loss"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas add column with value based on condition based on other columns

Tags:

python

pandas

Rutger Hofste

People also ask

1 Answers

cs95

Recent Activity

Donate For Us