I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby
e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]
I want the output as row: ["1234", [ 1 1 0]]
so the vector is a sum of all vectors grouped by userid
.
How can I achieve this? PySpark sum aggregate operation does not support the vector addition.
You have several options:
Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With