Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create bins on a column in a data frame

I have a data frame df with structure like so:

Input

amount id
13000  1
30000  2
10000  3
5000   4

I want to create a new column which based on the quantiles of column 'amount'

Expected Output:

amount id amount_bin
13000  1  10000
30000  2  15000
10000  3  10000
5000   4  5000

Assume the qualities 0.25, 0.5 and 0.75 are 5000, 10000 and 15000 respectively

I know how to do this in R:

quantile <- quantile(df$amount, probs = c(0, 0.25, 0.50, 0.75, 1.0), na.rm = TRUE, 
                     names = FALSE)

df$amount_bin <- cut(df$amount, breaks = quantile, include.lowest = TRUE, 
                     labels = c(quantile[2], quantile[3], quantile[4], quantile[5]))
like image 340
Anubhav Dikshit Avatar asked Apr 26 '17 07:04

Anubhav Dikshit


People also ask

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.


2 Answers

The QuantileDiscretizer works ok if your data is neatly distributed, however when you specify numBuckets it does not split the range of values in a column into equally sized bins, but rather by some heuristic. Nor are you able to select the boundaries of your bins.

The Bucketizer from Spark ML does have these features however:

import org.apache.spark.ml.feature.Bucketizer

val data = Array(0.99, 0.64, 0.39, 0.44, 0.15, 0.05, 0.30, 0.31, 0.22, 0.45, 0.52, 0.26)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("continuousFeature")

val bucketizer = new Bucketizer()
    .setInputCol("continuousFeature")
    .setOutputCol("discretizedFeature")
    .setSplits( Array(0.0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.0 ) )

    // the array of split values are the binning boundaries

val binnedData = bucketizer.transform(df)

binnedData.show

+-----------------+------------------+
|continuousFeature|discretizedFeature|
+-----------------+------------------+
|             0.99|               9.0|
|             0.64|               6.0|
|             0.39|               3.0|
|             0.44|               4.0|
|             0.15|               1.0|
|             0.05|               0.0|
|              0.3|               3.0|
|             0.31|               3.0|
|             0.22|               2.0|
|             0.45|               4.0|
|             0.52|               5.0|
|             0.26|               2.0|
+-----------------+------------------+

Which I think is much nicer. Gives you a lot more control over your result.

Note that the range of your splits needs to contain all of the values in your input column, otherwise you will have to setup rules for handling invalid input values using the setHandleInvalid method.

You do not need to specify regularly spaced bins as I have in this example.

Scaladoc https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Bucketizer

Another example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/BucketizerExample.scala

like image 96
Wade Jensen Avatar answered Sep 27 '22 23:09

Wade Jensen


You can use QuantileDiscretizer from ML library.

Create buckets based on fitted quantiles:

import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((13000, 1), (30000, 2), (10000, 3), (5000, 4))
val df = spark.createDataFrame(data).toDF("amount", "id")

val discretizer = new QuantileDiscretizer()
  .setInputCol("amount")
  .setOutputCol("result")
  .setNumBuckets(4)

val result = discretizer.fit(df).transform(df)
result.show()
like image 27
ImDarrenG Avatar answered Sep 27 '22 22:09

ImDarrenG