I have a df which need to be grouped, filtered, modified and a custom function applied. My 'normal' approach is to slow and not the most elegant one!
[name] [cnt] [num] [place] [y]
AAAA 12 20182.0 5.0 1.75
BBBB 12 20182.0 7.0 2.00
AAAA 10 20381.0 10.0 9.25
BBBB 10 20381.0 12.0 18.75
EEEE 12 21335.0 1.0 0.00
RRRR 12 21335.0 8.0 3.00
CCCC 12 21335.0 9.0 3.50
I need to group the df on [num] i.e.:
[name] [cnt] [num] [place] [y]
AAAA 12 20182.0 5.0 1.75
BBBB 12 20182.0 7.0 2.00
For each of those groups I need to do three tasks:
I. Filter out all rows inside one group with same [y] value. Groups can consist of up to 6 values.
II. Create all possible subsets, with length two, for the [place]: (5,7) and (7,5)
III. Apply custom function to every subset:
def func(p1, p2):
diff_p = p2-p1
if diff_p > 0:
return 2 / (diff_p * p2)
else:
return p1 / (diff_p * 12)
Where p1 = first place of tuple; p2 = second place of tuple; 12 is the value from [cnt] column. Which gives for the example group:
[name] [cnt] [num] [place] [y] [desired]
AAAA 12 20182.0 5.0 1.75 0.1428571429
BBBB 12 20182.0 7.0 2.00 -0.2916666667
AAAA's [desired] column holds the mean 'custom function result' of all subsets where AAAA's place value is the first part of the tuple. Which is only one tuple in this example.
(But like mentioned the groups can consist of up to 6 values, which will create multiple tuples where AAAA's place is the first value)
My current approach is to do a
df.groupby('num').apply(...)
apply will do:
.drop_duplicates('y',keep=False)
list(itertools.permutations(df_grp.place.values, 2))
apply the custom function
.mean()
It becomes really really slow after a while since the first df is the output from another .groupby().apply() call
Try GroupBy.aggregate(func, *args, **kwargs)[source]
to aggregate your three tasks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With