I have the following Spark dataframe :
agent_id|payment_amount|
+--------+--------------+
| a| 1000|
| b| 1100|
| a| 1100|
| a| 1200|
| b| 1200|
| b| 1250|
| a| 10000|
| b| 9000|
+--------+--------------+
my desire output would be something like
agen_id 95_quantile
a whatever is 95 quantile for agent a payments
b whatever is 95 quantile for agent b payments
for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
but I take the following error:
'GroupedData' object has no attribute 'approxQuantile'
I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes
I am using Spark 2.0.0
Percentiles are given as percent values, values such as 95%, 40%, or 27%. Quantiles are given as decimal values, values such as 0.95, 0.4, and 0.27. The 0.95 quantile point is exactly the same as the 95th percentile point.
percentile_approx. Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value. The value of percentage must be between 0.0 and 1.0.
We can define our own UDF in PySpark, and then we can use the python library np. The numpy has the method that calculates the median of a data frame. The data frame column is first grouped by based on a column value and post grouping the column whose median needs to be calculated in collected as a list of Array.
One solution would be to use percentile_approx
:
>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
>>> df2.show()
# +--------+-----------------+
# |agent_id| approxQuantile|
# +--------+-----------------+
# | a|8239.999999999998|
# | b|7449.999999999998|
# +--------+-----------------+
Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext
.
Note 2 : approxQuantile
isn't available in Spark < 2.0 for pyspark
.
Note 3 : percentile
returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.
EDIT : From Spark 2+, HiveContext
is not required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With