Is it possible to directly compute the product (or for example sum) of two columns without using
grouped.apply(lambda x: (x.a*x.b).sum()
It is much (less than half the time on my machine) faster to use
df['helper'] = df.a*df.b
grouped= df.groupby(something)
grouped['helper'].sum()
df.drop('helper', axis=1)
But I don't really like having to do this. It is for example useful to compute the weighted average per group. Here the lambda approach would be
grouped.apply(lambda x: (x.a*x.b).sum()/(df.b).sum())
and again is much slower than dividing the helper by b.sum().
I want to eventually build an embedded array expression evaluator (Numexpr on steroids) to do things like this. Right now we're working with the limitations of Python-- if you implemented a Cython aggregator to do (x * y).sum()
then it could be connected with groupby, but ideally you could write the Python expression as a function:
def weight_sum(x, y):
return (x * y).sum()
and that would get "JIT-compiled" and be about as fast as groupby(...).sum(). What I'm describing is a pretty significant (many month) project. If there were a BSD-compatible APL implementation I might be able to do something like the above quite a bit sooner (just thinking out loud).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With