Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to derive Percentile using Spark Data frame and GroupBy in python

I have a Spark dataframe which has Date, Group and Price columns.

I'm trying to derive the percentile(0.6) for the Price column of that dataframe in Python. Besides, I need to add the output as a new column.

I tried the code below:

perudf = udf(lambda x: x.quantile(.6))
df1 = df.withColumn("Percentile", df.groupBy("group").agg("group"),perudf('price'))

but it is throwing the following error:

assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
AssertionError: all exprs should be Column
like image 278
Somashekar Muniyappa Avatar asked Jan 07 '23 03:01

Somashekar Muniyappa


1 Answers

You can use "percentile_approx" using sql. It is difficult to create UDF in pyspark.

Refer to this link for other details: https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62wQV68D6J87EVq6AD5-T3D0F3fHjuzs+1C5aCHOUUQS8w@mail.gmail.com%3E

like image 157
user3343061 Avatar answered Jan 10 '23 20:01

user3343061