Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas df.apply unexpectedly changes dataframe inplace

From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes. However, I've found the following inconsistent behavior:

Let's apply a dummy function for the sake of ensuring that the original df remains untouched:

>>> def foo(row: pd.Series):
...     row['b'] = '42'

>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
    a   b
0   a0  b0
1   a1  b1

This behaves as expected. However, foo will apply the changes inplace if we modify the way we initialize this df:

>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
    a   b
0   a0  42
1   a1  42

I've also noticed that the above is not true if the columns dtypes are not of type 'object'. Why does apply() behave differently in these two contexts?

Python: 3.6.5

Pandas: 0.23.1

like image 712
Pedro Fialho Avatar asked Sep 22 '18 15:09

Pedro Fialho


2 Answers

Interesting question! I believe the behavior you're seeing is an artifact of the way you use apply.

As you correctly indicate, apply is not intended to be used to modify a dataframe. However, since apply takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe. Here, you've found a great example of that behavior, because your function foo attempts to modify the row that it is passed by apply.

Using apply to modify a row could lead to these side effects. This isn't the best practice.

Instead, consider this idiomatic approach for apply. The function apply is often used to create a new column. Here's an example of how apply is typically used, which I believe would steer you away from this potentially troublesome area:

import pandas as pd
# construct df2 just like you did
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0','b0']
df2['b'] = ['a1','b1']

df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
df2['b_replace'] = df2.apply(lambda row: '42', axis=1) 
df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column

print(df2)

# output:
#     a   b b_copy b_replace b_reverse
# 0  a0  a1     a1        42        1a
# 1  b0  b1     b1        42        1b

Notice that pandas passed a row or a cell to the function you give as the first argument to apply, then stores the function's output in a column of your choice.

If you'd like to modify a dataframe row-by-row, take a look at iterrows and loc for the most idiomatic route.

like image 115
Maxim Zaslavsky Avatar answered Oct 05 '22 15:10

Maxim Zaslavsky


Maybe late but I think it may help especially for someone who reach this question.

When we use the foo like:

def foo(row: pd.Series):
    row['b'] = '42'

and then use it in:

df.apply(foo, axis=1)

we won't expect to occur any change in df but it occers. why?

Let's review what happens under the hood:

apply function calls foo and pass one row to it. As it is not of type of specific types in python (like int, float, str, ...) but is an object, so by python rules it is passed by reference not by value. So it is completely equivalent with the row that is sent by apply function.(Equal in values and both points to same block of ram.) So any change to row in foo function will changes the row - which its type is pandas.series and that points to a block of memory that df.row resides - immediately.

We can rewrite the foo(I name it bar) function to not change anything inplace. ( by deep copying row that means make another row with same value(s) but on another cell of ram). This is what relly happens when we use lambda in apply function.

def bar(row: pd.Series):
    row_temp=row.copy(deep=True)
    row_temp['b'] = '42'
    return row_temp

Complete Code

import pandas as pd


#Changes df in place -- not like lamda
def foo(row: pd.Series):
    row['b'] = '42'


#Do not change df inplace -- works like lambda
def bar(row: pd.Series):
    row_temp = row.copy(deep=True)
    row_temp['b'] = '42'
    return row_temp


df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0', 'a1']
df2['b'] = ['b0', 'b1']

print(df2)

# No change inplace
df_b = df2.apply(bar, axis=1)
print(df2)
# bar function works
print(df_b)

print(df2)
# Changes inplace
df2.apply(foo, axis=1)
print(df2)


Output

#df2 before any change
    a   b
0  a0  b0
1  a1  b1

#calling df2.apply(bar, axis=1) not changed df2 inplace
    a   b
0  a0  b0
1  a1  b1

#df_b = df2.apply(bar, axis=1) #bar is working as expected
    a   b
0  a0  42
1  a1  42

#print df2 again to assure it is not changed
    a   b
0  a0  b0
1  a1  b1

#call df2.apply(foo, axis=1) -- as we see foo changed df2 inplace ( to compare with bar)
    a   b
0  a0  42
1  a1  42
like image 39
Seyfi Avatar answered Oct 05 '22 14:10

Seyfi