I have a data frame df with structure like so:
Input
amount id
13000 1
30000 2
10000 3
5000 4
I want to create a new column which based on the quantiles of column 'amount'
Expected Output:
amount id amount_bin
13000 1 10000
30000 2 15000
10000 3 10000
5000 4 5000
Assume the qualities 0.25, 0.5 and 0.75 are 5000, 10000 and 15000 respectively
I know how to do this in R:
quantile <- quantile(df$amount, probs = c(0, 0.25, 0.50, 0.75, 1.0), na.rm = TRUE,
names = FALSE)
df$amount_bin <- cut(df$amount, breaks = quantile, include.lowest = TRUE,
labels = c(quantile[2], quantile[3], quantile[4], quantile[5]))
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
The QuantileDiscretizer works ok if your data is neatly distributed, however when you specify numBuckets it does not split the range of values in a column into equally sized bins, but rather by some heuristic. Nor are you able to select the boundaries of your bins.
The Bucketizer from Spark ML does have these features however:
import org.apache.spark.ml.feature.Bucketizer
val data = Array(0.99, 0.64, 0.39, 0.44, 0.15, 0.05, 0.30, 0.31, 0.22, 0.45, 0.52, 0.26)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("continuousFeature")
val bucketizer = new Bucketizer()
.setInputCol("continuousFeature")
.setOutputCol("discretizedFeature")
.setSplits( Array(0.0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.0 ) )
// the array of split values are the binning boundaries
val binnedData = bucketizer.transform(df)
binnedData.show
+-----------------+------------------+
|continuousFeature|discretizedFeature|
+-----------------+------------------+
| 0.99| 9.0|
| 0.64| 6.0|
| 0.39| 3.0|
| 0.44| 4.0|
| 0.15| 1.0|
| 0.05| 0.0|
| 0.3| 3.0|
| 0.31| 3.0|
| 0.22| 2.0|
| 0.45| 4.0|
| 0.52| 5.0|
| 0.26| 2.0|
+-----------------+------------------+
Which I think is much nicer. Gives you a lot more control over your result.
Note that the range of your splits needs to contain all of the values in your input column, otherwise you will have to setup rules for handling invalid input values using the setHandleInvalid method.
You do not need to specify regularly spaced bins as I have in this example.
Scaladoc https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Bucketizer
Another example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/BucketizerExample.scala
You can use QuantileDiscretizer from ML library.
Create buckets based on fitted quantiles:
import org.apache.spark.ml.feature.QuantileDiscretizer
val data = Array((13000, 1), (30000, 2), (10000, 3), (5000, 4))
val df = spark.createDataFrame(data).toDF("amount", "id")
val discretizer = new QuantileDiscretizer()
.setInputCol("amount")
.setOutputCol("result")
.setNumBuckets(4)
val result = discretizer.fit(df).transform(df)
result.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With