Calculate quantile on grouped data in spark Dataframe

Tags:

I have the following Spark dataframe :

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|          1100|
|       a|          1200|
|       b|          1200|
|       b|          1250|
|       a|         10000|
|       b|          9000|
+--------+--------------+

my desire output would be something like

agen_id   95_quantile
  a          whatever is 95 quantile for agent a payments
  b          whatever is 95 quantile for agent b payments

for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but I take the following error:

'GroupedData' object has no attribute 'approxQuantile'

I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

I am using Spark 2.0.0

349

asked Sep 22 '16 08:09

chessosapiens

1 Answers

One solution would be to use percentile_approx :

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+

Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

EDIT : From Spark 2+, HiveContext is not required.

154

answered Sep 24 '22 15:09

eliasah

Related questions
                            
                                Spark: java.io.IOException: No space left on device
                            
                                How to use Spark SQL DataFrame with flatMap?
                            
                                How to sort an RDD and limit in Spark?
                            
                                pyspark: grouby and then get max value of each group
                            
                                Value for HADOOP_CONF_DIR from Cluster
                            
                                How to pass external parameters through Spark submit
                            
                                spark: How to do a dropDuplicates on a dataframe while keeping the highest timestamped row [duplicate]
                            
                                Randomly shuffle column in Spark RDD or dataframe
                            
                                Fill Pyspark dataframe column null values with average value from same column
                            
                                Spark with HBASE vs Spark with HDFS
                            
                                Creating Pyspark DataFrame column that coalesces two other Columns, why am I getting error of 'unicode' object has no attribute isNull?
                            
                                How spark handles object
                            
                                How to display a KeyValueGroupedDataset in Spark?
                            
                                How to continuously monitor a directory by using Spark Structured Streaming
                            
                                How to access an array element in dataframe column (scala) [duplicate]
                            
                                spark windowing function VS group by performance issue
                            
                                Operating RDD failed while setting Spark record delimiter with org.apache.hadoop.conf.Configuration
                            
                                Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both
                            
                                spark-submit EMR Step failing when submitted using boto3 client
                            
                                Count instances of combination of columns in spark dataframe using scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate quantile on grouped data in spark Dataframe

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

chessosapiens

People also ask

1 Answers

eliasah

Recent Activity

Donate For Us