I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of <code>groupBy</code> / <code>agg</code>, so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well. This question is related but does not indicate how to use <code>approxQuantile</code> as an aggregate function. I also have access to the <code>percentile_approx</code> Hive UDF but I don't know how to use it as an aggregate function. For the sake of specificity, suppose I have the following dataframe: <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark import SparkContext import pyspark.sql.functions as f sc = SparkContext() df = sc.parallelize([ ['A', 1], ['A', 2], ['A', 3], ['B', 4], ['B', 5], ['B', 6], ]).toDF(('grp', 'val')) df_grp = df.groupBy('grp').agg(f.magic_percentile('val', 0.5).alias('med_val')) df_grp.show() </code></pre> Expected result is: <pre class="prettyprint"><code>+----+-------+ | grp|med_val| +----+-------+ | A| 2| | B| 5| +----+-------+ </code></pre>

I guess you don't need it anymore. But will leave it here for future generations (i.e. me next week when I forget). <pre class="prettyprint"><code>from pyspark.sql import Window import pyspark.sql.functions as F grp_window = Window.partitionBy('grp') magic_percentile = F.expr('percentile_approx(val, 0.5)') df.withColumn('med_val', magic_percentile.over(grp_window)) </code></pre> Or to address exactly your question, this also works: <pre class="prettyprint"><code>df.groupBy('grp').agg(magic_percentile.alias('med_val')) </code></pre> And as a bonus, you can pass an array of percentiles: <pre class="prettyprint"><code>quantiles = F.expr('percentile_approx(val, array(0.25, 0.5, 0.75))') </code></pre> And you'll get a list in return.

Since you have access to <code>percentile_approx</code>, one simple solution would be to use it in a <code>SQL</code> command: <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df.registerTempTable("df") df2 = sqlContext.sql("select grp, percentile_approx(val, 0.5) as med_val from df group by grp") </code></pre>

Median / quantiles within PySpark groupBy

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg, so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well.

This question is related but does not indicate how to use approxQuantile as an aggregate function.

I also have access to the percentile_approx Hive UDF but I don't know how to use it as an aggregate function.

For the sake of specificity, suppose I have the following dataframe:

from pyspark import SparkContext import pyspark.sql.functions as f  sc = SparkContext()      df = sc.parallelize([     ['A', 1],     ['A', 2],     ['A', 3],     ['B', 4],     ['B', 5],     ['B', 6], ]).toDF(('grp', 'val'))  df_grp = df.groupBy('grp').agg(f.magic_percentile('val', 0.5).alias('med_val')) df_grp.show()

Expected result is:

+----+-------+ | grp|med_val| +----+-------+ |   A|      2| |   B|      5| +----+-------+

470

asked Oct 20 '17 08:10

abeboparebop

2 Answers

I guess you don't need it anymore. But will leave it here for future generations (i.e. me next week when I forget).

from pyspark.sql import Window import pyspark.sql.functions as F  grp_window = Window.partitionBy('grp') magic_percentile = F.expr('percentile_approx(val, 0.5)')  df.withColumn('med_val', magic_percentile.over(grp_window))

Or to address exactly your question, this also works:

df.groupBy('grp').agg(magic_percentile.alias('med_val'))

And as a bonus, you can pass an array of percentiles:

quantiles = F.expr('percentile_approx(val, array(0.25, 0.5, 0.75))')

And you'll get a list in return.

150

answered Sep 18 '22 13:09

kael

Since you have access to percentile_approx, one simple solution would be to use it in a SQL command:

from pyspark.sql import SQLContext sqlContext = SQLContext(sc)  df.registerTempTable("df") df2 = sqlContext.sql("select grp, percentile_approx(val, 0.5) as med_val from df group by grp")

answered Sep 18 '22 13:09

Shaido

Related questions
                            
                                Derive multiple columns from a single column in a Spark DataFrame
                            
                                What conditions should cluster deploy mode be used instead of client?
                            
                                View RDD contents in Python Spark?
                            
                                Spark load data and add filename as dataframe column
                            
                                Convert date from String to Date format in Dataframes
                            
                                PySpark: multiple conditions in when clause
                            
                                Find maximum row per group in Spark DataFrame
                            
                                Append a column to Data Frame in Apache Spark 1.3
                            
                                Pyspark replace strings in Spark dataframe column
                            
                                Explain the aggregate functionality in Spark (with Python and Scala)
                            
                                How do I detect if a Spark DataFrame has a column
                            
                                Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?
                            
                                Difference between == and === in Scala, Spark
                            
                                'PipelinedRDD' object has no attribute 'toDF' in PySpark
                            
                                Pyspark: Pass multiple columns in UDF
                            
                                Importing spark.implicits._ in scala
                            
                                Which operations preserve RDD order?
                            
                                Why does a job fail with "No space left on device", but df says otherwise?
                            
                                What is the difference between Apache Mahout and Apache Spark's MLlib?
                            
                                PySpark groupByKey returning pyspark.resultiterable.ResultIterable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With