Custom aggregation on PySpark dataframes [duplicate]

Question

I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby

e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]

I want the output as row: ["1234", [ 1 1 0]] so the vector is a sum of all vectors grouped by userid.

How can I achieve this? PySpark sum aggregate operation does not support the vector addition.

Assaf Mendelson · Accepted Answer

You have several options:

Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
You can move to RDD and use aggregate or aggregate by key.

Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

Custom aggregation on PySpark dataframes [duplicate]

Tags:

aggregate-functions

apache-spark

apache-spark-sql

pyspark

user-defined-functions

user2242666

1 Answers

Assaf Mendelson

Recent Activity

Donate For Us

Custom aggregation on PySpark dataframes [duplicate]

Tags:

aggregate-functions

apache-spark

apache-spark-sql

pyspark

user-defined-functions

user2242666

1 Answers

Assaf Mendelson

Related questions

Recent Activity

Donate For Us