Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate percentile on pyspark dataframe columns

I have a PySpark dataframe which contains an ID and then a couple of variables for which I want to calculate the 95% point.

Part of the printSchema():

root
 |-- ID: string (nullable = true)
 |-- MOU_G_EDUCATION_ADULT: double (nullable = false)
 |-- MOU_G_EDUCATION_KIDS: double (nullable = false)

I found How to derive Percentile using Spark Data frame and GroupBy in python, but this fails with an error message:

perc95_udf = udf(lambda x: x.quantile(.95))


fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", perc95_udf('MOU_G_EDUCATION_ADULT')) \
                      .withColumn("P95_MOU_G_EDUCATION_KIDS", perc95_udf('MOU_G_EDUCATION_KIDS'))

fanscores.take(2) 

AttributeError: 'float' object has no attribute 'quantile'

Other UDF trials I already tried:

def percentile(quantiel,kolom):
    x=np.array(kolom)
    perc=np.percentile(x, quantiel)
    return perc

percentile_udf = udf(percentile, LongType())


fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_ADULT)) \
                  .withColumn("P95_MOU_G_EDUCATION_KIDS", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_KIDS))

fanscores.take(2)   

gives the error: "TypeError: wrapper() got an unexpected keyword argument 'quantiel'"

My final trial:

import numpy as np

def percentile(quantiel):
    return udf(lambda kolom: np.percentile(np.array(kolom), quantiel))

fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile(quantiel=95)(genres.MOU_G_EDUCATION_ADULT)) \
                  .withColumn("P95_MOU_G_EDUCATION_KIDS", percentile(quantiel=95) (genres.MOU_G_EDUCATION_KIDS))

fanscores.take(2)  

Gives the error:

PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

How could I solve this ?

like image 335
Wendy De Wit Avatar asked Sep 19 '18 12:09

Wendy De Wit


1 Answers

df.selectExpr('percentile(MOU_G_EDUCATION_ADULT, 0.95)').show()

For large datasets consider using percentile_approx()

like image 128
Michael West Avatar answered Sep 30 '22 01:09

Michael West