Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to update column value for subset of rows on Pandas DataFrame?

Tags:

python

pandas

When using Pandas to update the value of a column for specif subset of rows, what is the best way to do it?

Easy example:

import pandas as pd

df = pd.DataFrame({'name' : pd.Series(['Alex', 'John', 'Christopher', 'Dwayne']),
                   'value' : pd.Series([1., 2., 3., 4.])})

Objective: update the value column based on names length and the initial value of the value column itself.

The following line achieves the objective:

df.value[df.name.str.len() == 4 ] = df.value[df.name.str.len() == 4] * 1000

However, this line filters the whole data frame two times, both in LHS and RHS. I assume is not the most efficient way. And it does not do it 'in place'.

Basically I'm looking for the pandas equivalent to R data.table ':=' operator:

df[nchar(name) == 4, value := value*1000]

And for other kind of operations such:

df[nchar(name) == 4, value := paste0("short_", as.character(value))]

Environment: Python 3.6 Pandas 0.22

Thanks in advance.

like image 792
AlexSB Avatar asked Feb 13 '18 11:02

AlexSB


People also ask

How replace column values in pandas based on multiple conditions?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

Which is better LOC or ILOC?

The main difference between pandas loc[] vs iloc[] is loc gets DataFrame rows & columns by labels/names and iloc[] gets by integer Index/position. For loc[], if the label is not present it gives a key error. For iloc[], if the position is not present it gives an index error.

How do I change the values in a pandas DataFrame column?

In order to replace a value in Pandas DataFrame, use the replace() method with the column the from and to values. Below example replace Spark with PySpark value on the Course column. Notice that all the Spark values are replaced with the Pyspark values under the first column.


2 Answers

You need loc with *=:

df.loc[df.name.str.len() == 4, 'value'] *= 1000
print (df)
          name   value
0         Alex  1000.0
1         John  2000.0
2  Christopher     3.0
3       Dwayne     4.0

EDIT:

More general solutions:

mask = df.name.str.len() == 4
df.loc[mask, 'value'] = df.loc[mask, 'value'] * 1000

Or:

df.update(df.loc[mask, 'value'] * 1000)
like image 75
jezrael Avatar answered Oct 20 '22 00:10

jezrael


This may be what you require:

 df.loc[df.name.str.len() == 4, 'value'] *= 1000

 df.loc[df.name.str.len() == 4, 'value'] = 'short_' + df['value'].astype(str)
like image 35
jpp Avatar answered Oct 19 '22 23:10

jpp