Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update a dataframe within apply after using groupby

I have a pandas dataframe that I want to group on and then update the original dataframe using iterrows and set_value. This doesn't appear to work.

Here is an example.

In [1]: def func(df, n):
   ...:     for i, row in df.iterrows():
   ...:         print("Updating {0} with value {1}".format(i, n))
   ...:         df.set_value(i, 'B', n)

In [2]: df = pd.DataFrame({"A": [1, 2], "B": [0, 0]})

In [3]: df
Out[4]:
   A  B
0  1  0
1  2  0

In [125]: func(df, 1)
Updating 0 with value 1
Updating 1 with value 1

In [126]: df
Out[126]:
   A  B
0  1  1
1  2  1

In [127]: df.groupby('A').apply(lambda df: func(df, 2))
Updating 0 with value 2
Updating 0 with value 2
Updating 1 with value 2
In [126]: df
Out[126]:
   A  B
0  1  1
1  2  1

I was hoping that B would have been updated to 2.

Why isn't this working, and what is the best way to achieve this result?

like image 812
Kris Harper Avatar asked Jun 11 '26 18:06

Kris Harper


1 Answers

The way you have things written, you seem to want the function func(df, n) to modify df in place. But df.groupby('A') (in some sense) creates another set of dataframes (one for each group), so using func() as an argument to df.groupby('A').apply() only modifies the these newly created dataframes and not the original df. Furthermore, the returned dataframe is a concatenation of the outputs of func() called with each group as an argument, which is why the returned dataframe is empty.

The shortest fix to your problem is to return df at the end of func:

def func(df, n):
    for i, row in df.iterrows():
        print("Updating {0} with value {1}".format(i, n))
        df.set_value(i, 'B', n)
    return df
df = df.groupby('A').apply(lambda df: func(df, 2))

I presume this is not exactly what you had in mind because you're probably expecting to modify everything in place. If modifying everything in place is your intention, you'd need to use combinations of a for loop and .loc, but modifying your dataframe with .loc will be computationally expensive if you intend to call .loc many times.

I would also guess that your function to set values depends on a more complicated criterion, but usually you can vectorize things and avoid having to use .iterrows() altogether.

To avoid the XY problem, I'd suggest describing your function in more detail, because chances are that you can get everything done with a few lines incorporating the use of .loc and avoiding the need to iterate through every row in Python. Case in point: df['B'] = 2 (sans a print statement) is a one-liner solution to your problem.

like image 150
Ken Wei Avatar answered Jun 13 '26 07:06

Ken Wei