Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently setting values on a subset of rows

Tags:

pandas

I am wondering about the best way to change values in a subset of rows in a dataframe. Let's say I want to double the values in column value in rows where selected is true.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'value': [1, 2, 3, 4], 'selected': [False, False, True, True]})
In [3]: df
Out[3]: 
  selected  value
0    False      1
1    False      2
2     True      3
3     True      4

There are several ways to do this:

# 1. Subsetting with .loc on left and right hand side:
df.loc[df['selected'], 'value'] = df.loc[df['selected'], 'value'] * 2

# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2

# 3. Using where()
df['value'] = (df['value'] * 2).where(df['selected'], df['value'])

If I only subset on the left hand side (option 2), would Pandas actually make the calculation for all rows and then discard the result for all but the selected rows?

In terms of evaluation, is there any difference between using loc and where?

like image 318
malte Avatar asked Oct 30 '22 07:10

malte


1 Answers

Your #2 option is the most standard and recommended way to do this. Your #1 option is fine also, but the extra code is unnecessary because ix/loc/iloc are designed to pass the boolean selection through and do the necessary alignment to make sure it applies only to your desired subset.

# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2

If you don't use ix/loc/iloc on the left hand side, problems can arise that we don't want to get into in a simple answer. Hence, using ix/loc/iloc is generally the safest and most recommened way to go. There is nothing wrong with your option #3, but it is the least readable of the three.

One faster and acceptable alternative you should know about is numpy's where() function:

df['value'] = np.where( df['selected'], df['value'] * 2, df['value'] )

The first argument is the selection or mask, the second is the value to assign if True, and third is the value to assign if false. It's especially useful if you want to also create or change the value if the selection is False.

like image 122
JohnE Avatar answered Nov 08 '22 15:11

JohnE