I would like to groupBy my spark df with custom agg function:
def gini(list_of_values):
sth is processing here
return number output
I would like to get sth like that:
df.groupby('activity')['mean_event_duration_in_hours].agg(gini)
Could you please help me to resolve this tackle?
You can create a udf
like so:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
def gini(list_of_values):
# sth is processing here
return number_output
udf_gini = F.udf(gini, FloatType())
df.groupby('activity')\
.agg(F.collect_list("mean_event_duration_in_hours").alias("event_duration_list"))\
.withColumn("gini", udf_gini(F.col("event_duration_list")))
Or define gini as a UDF like this:
@udf(returnType=FloatType())
def gini(list_of_values):
# sth is processing here
return number_output
df.groupby('activity')\
.agg(F.collect_list("mean_event_duration_in_hours").alias("event_duration_list"))\
.withColumn("gini", gini(F.col("event_duration_list")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With