How to create bins on a column in a data frame

Tags:

2 Answers

The QuantileDiscretizer works ok if your data is neatly distributed, however when you specify numBuckets it does not split the range of values in a column into equally sized bins, but rather by some heuristic. Nor are you able to select the boundaries of your bins.

The Bucketizer from Spark ML does have these features however:

import org.apache.spark.ml.feature.Bucketizer

val data = Array(0.99, 0.64, 0.39, 0.44, 0.15, 0.05, 0.30, 0.31, 0.22, 0.45, 0.52, 0.26)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("continuousFeature")

val bucketizer = new Bucketizer()
    .setInputCol("continuousFeature")
    .setOutputCol("discretizedFeature")
    .setSplits( Array(0.0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.0 ) )

    // the array of split values are the binning boundaries

val binnedData = bucketizer.transform(df)

binnedData.show

+-----------------+------------------+
|continuousFeature|discretizedFeature|
+-----------------+------------------+
|             0.99|               9.0|
|             0.64|               6.0|
|             0.39|               3.0|
|             0.44|               4.0|
|             0.15|               1.0|
|             0.05|               0.0|
|              0.3|               3.0|
|             0.31|               3.0|
|             0.22|               2.0|
|             0.45|               4.0|
|             0.52|               5.0|
|             0.26|               2.0|
+-----------------+------------------+

Which I think is much nicer. Gives you a lot more control over your result.

Note that the range of your splits needs to contain all of the values in your input column, otherwise you will have to setup rules for handling invalid input values using the setHandleInvalid method.

You do not need to specify regularly spaced bins as I have in this example.

Scaladoc https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Bucketizer

Another example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/BucketizerExample.scala

answered Sep 27 '22 23:09

Wade Jensen

You can use QuantileDiscretizer from ML library.

Create buckets based on fitted quantiles:

import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((13000, 1), (30000, 2), (10000, 3), (5000, 4))
val df = spark.createDataFrame(data).toDF("amount", "id")

val discretizer = new QuantileDiscretizer()
  .setInputCol("amount")
  .setOutputCol("result")
  .setNumBuckets(4)

val result = discretizer.fit(df).transform(df)
result.show()

answered Sep 27 '22 22:09

ImDarrenG

Related questions
                            
                                Play 2 reverse routing, get route from controller method
                            
                                Play Framework: What happens when requests exceeds the available threads
                            
                                Enums in Scala with multiple constructor parameters
                            
                                Are there any means in Scala to split a class code into many files?
                            
                                How to define a tag with Play 2.0?
                            
                                Forms in Scala play framework
                            
                                Error running scala console. Module not found
                            
                                spark-shell with colored repl
                            
                                Early return from a Scala constructor
                            
                                Passing functions for all applicable types around
                            
                                How do I add an XML tag or not, depending on an Option in Scala?
                            
                                Why does Scala's require method in Predef allow a String as argument?
                            
                                implicit parameter VS default parameter value
                            
                                GroupBy in scala
                            
                                Ignore case for a string in scala
                            
                                Removing Blank Strings from a Spark Dataframe
                            
                                How to make a class fully immutable in Scala
                            
                                Making multiple API calls in a functional way
                            
                                How to get Future[Seq[Person]] instead of Seq[Future[Person]]
                            
                                how to elegantly concatenate an Option[List] to a List?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create bins on a column in a data frame

Tags:

scala

apache-spark

Anubhav Dikshit

People also ask

2 Answers

Wade Jensen

ImDarrenG

Recent Activity

Donate For Us