Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly use pandas groupby with apply function for side effects? (First group applied twice)

I am using pandas to groupby certain columns in a dataframe and apply a custom function to these groups. The applied function makes use of side effects and acts on global data objects within the function.

A documented caveat with pandas, groupby and apply is that by design it applies the function called twice on the first group to decide whether it can take a fast or slow code path. This is documented here: http://pandas.pydata.org/pandas-docs/stable/groupby.html#flexible-apply

Demonstrated here:

In [144]: d = pd.DataFrame({"a":["x", "y"], "b":[1,2]})

In [145]: def identity(df):
   .....:     print(df)
   .....:     return df
   .....: 

In [146]: d.groupby("a").apply(identity)
   a  b
0  x  1
   a  b
0  x  1
   a  b
1  y  2
Out[146]: 
   a  b
0  x  1
1  y  2

Mentioned in a few other stackoverflow posts here:

Python pandas groupby object apply method duplicates first group

Is Pandas 0.16.1 groupby().apply() method applying function more than once to the same group?

Mentioned on GitHub here:

https://github.com/pandas-dev/pandas/issues/7739

https://github.com/pandas-dev/pandas/issues/19167

This means that my side effect is called twice on the first group and results in unwanted changes.

My question is how do I use pandas, groupby and apply without the side effects being applied twice on the first group (or any group for that matter) and guarantee it is only called once on every group?

I was thinking of created a dummy/fake group at the top of the DataFrame but I wanted to extend my question to the stackoverflow community for a better solution and for the benefit of others.

Thank you for your help.

EDIT:

As requested in the comments, a few more details on the custom function and side effects.

The use of a custom function with side effects makes use of a global dictionary at the beginning and end of the function. It retrieves data with a key and applies those values to rows, at the end of the function, the updated values get updated to the global dictionary so the new values are reflected in the next iteration.

The main cause for using a groupby with apply is that it is the fastest application I have managed to find on effectively iterating over a dataframe groupby object. I have also looked at plain iteration and list comprehension.

like image 860
ZeroStack Avatar asked Nov 07 '22 01:11

ZeroStack


1 Answers

A follow up on this question, as of pandas version 0.25.0 released July 18, 2019 Groupby.apply on a dataframe evaluates the first group only once. Upgrading to this version is probably the most straightforward approach to resolving this.

Release information here: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html

like image 55
ZeroStack Avatar answered Nov 14 '22 23:11

ZeroStack