Replace values in a dataframe column based on condition

Question

I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.

Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:

In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2 NaN
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A                    B
0  1 -9223372036854775808
1  2 -9223372036854775808
2  3 -9223372036854775808

Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:

In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2   4
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A   B
0  1 NaN
1  2   2
2  3 NaN

But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:

In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2   1
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A                    B
0  1 -9223372036854775808
1  2                    1
2  3 -9223372036854775808

Is this a bug or a feature? How should I have done this replacement?

Thank you!

Jeff · Accepted Answer

This is a buggie, fixed here.

Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:

df.loc[lhs, column] = rhs

where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array

and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.

In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)

Suffice to say this was a missing edge case.

Replace values in a dataframe column based on condition

Tags:

python

pandas

dataframe

ozhogin

1 Answers

Jeff

Recent Activity

Donate For Us

Replace values in a dataframe column based on condition

Tags:

python

pandas

dataframe

ozhogin

1 Answers

Jeff

Related questions

Recent Activity

Donate For Us