My question is regarding immutability of pandas DataFrame when it is passed by reference. Consider the following code:
import pandas as pd
def foo(df1, df2):
df1['B'] = 1
df1 = df1.join(df2['C'], how='inner')
return()
def main(argv = None):
# Create DataFrames.
df1 = pd.DataFrame(range(0,10,2), columns=['A'])
df2 = pd.DataFrame(range(1,11,2), columns=['C'])
foo(df1, df2) # Pass df1 and df2 by reference.
print df1
return(0)
if __name__ == '__main__':
status = main()
sys.exit(status)
The output is
A B
0 0 1
1 2 1
2 4 1
3 6 1
4 8 1
and not
A B C
0 0 1 1
1 2 1 3
2 4 1 5
3 6 1 7
4 8 1 9
In fact, if foo is defined as
def foo(df1, df2):
df1 = df1.join(df2['C'], how='inner')
df1['B'] = 1
return()
(i.e. the "join" statement before the other statement) then the output is simply
A
0 0
1 2
2 4
3 6
4 8
I'm intrigued as to why this is the case. Any insights would be appreciated.
The issue is because of this line:
df1 = df1.join(df2['C'], how='inner')
df1.join(df2['C'], how='inner') returns a new dataframe. After this line, df1 no longer refers to the same dataframe as the argument, but a new one, because it's been reassigned to the new result. The first dataframe continues to exist, unmodified. This isn't really a pandas issue, just the general way python, and most other languages, work.
Some pandas functions have an inplace argument, which would do what you want, however the join operation doesn't. If you need to modify a dataframe, you'll have to return this new one instead and reassign it outside the function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With