Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing pandas DataFrame by reference

My question is regarding immutability of pandas DataFrame when it is passed by reference. Consider the following code:

import pandas as pd

def foo(df1, df2):

    df1['B'] = 1
    df1 = df1.join(df2['C'], how='inner')

    return()

def main(argv = None):

    # Create DataFrames. 
    df1 = pd.DataFrame(range(0,10,2), columns=['A'])
    df2 = pd.DataFrame(range(1,11,2), columns=['C'])

    foo(df1, df2)    # Pass df1 and df2 by reference.

    print df1

    return(0)

if __name__ == '__main__':
    status = main()
    sys.exit(status)

The output is

   A  B  
0  0  1
1  2  1
2  4  1
3  6  1
4  8  1

and not

   A  B  C
0  0  1  1
1  2  1  3
2  4  1  5
3  6  1  7
4  8  1  9

In fact, if foo is defined as

def foo(df1, df2):

    df1 = df1.join(df2['C'], how='inner')
    df1['B'] = 1

    return()

(i.e. the "join" statement before the other statement) then the output is simply

   A    
0  0 
1  2 
2  4 
3  6 
4  8

I'm intrigued as to why this is the case. Any insights would be appreciated.

like image 456
labrynth Avatar asked Mar 01 '26 08:03

labrynth


1 Answers

The issue is because of this line:

df1 = df1.join(df2['C'], how='inner')

df1.join(df2['C'], how='inner') returns a new dataframe. After this line, df1 no longer refers to the same dataframe as the argument, but a new one, because it's been reassigned to the new result. The first dataframe continues to exist, unmodified. This isn't really a pandas issue, just the general way python, and most other languages, work.

Some pandas functions have an inplace argument, which would do what you want, however the join operation doesn't. If you need to modify a dataframe, you'll have to return this new one instead and reassign it outside the function.

like image 173
Jezzamon Avatar answered Mar 03 '26 03:03

Jezzamon