Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom aggregation on PySpark dataframes [duplicate]

I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby

e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]

I want the output as row: ["1234", [ 1 1 0]] so the vector is a sum of all vectors grouped by userid.

How can I achieve this? PySpark sum aggregate operation does not support the vector addition.

like image 903
user2242666 Avatar asked Dec 07 '16 19:12

user2242666


1 Answers

You have several options:

  1. Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
  2. You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
  3. You can move to RDD and use aggregate or aggregate by key.

Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

like image 188
Assaf Mendelson Avatar answered Oct 17 '22 02:10

Assaf Mendelson