Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between QuantileDiscretizer and Bucketizer in Spark

This may be a novice question, however I'm unable to comprehend if there is any specific advantage of using QuantileDiscretizer over Bucketizerin spark 2.1 ?

I understand that QuantileDiscretizer is an estimator and handles NAN values whereas Bucketizer is a transformer and raises error if data has NAN values.

from the spark documentation , below code produces similar outputs

from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import Bucketizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)]
df = spark.createDataFrame(data, ["id", "hour"])

result_discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour",outputCol="result").fit(df).transform(df)
result_discretizer.show()

splits = [-float("inf"),3, 10,float("inf")]
result_bucketizer = Bucketizer(splits=splits, inputCol="hour",outputCol="result").transform(df)
result_bucketizer.show()

Output :

+---+----+------+
| id|hour|result|
+---+----+------+
|  0|18.0|   2.0|
|  1|19.0|   2.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   0.0|
+---+----+------+

+---+----+------+
| id|hour|result|
+---+----+------+
|  0|18.0|   2.0|
|  1|19.0|   2.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   0.0|
+---+----+------+

Please let me know if there is any significant advantage of one over other?

like image 830
Nim J Avatar asked Apr 13 '17 07:04

Nim J


People also ask

What is Bucketizer?

Bucketizer is used to transform a column of continuous features to a column of feature buckets. We specify the n+1 splits parameter for mapping continuous features into n buckets. The splits should be in a strictly increasing order. Typically, we add Double. NegativeInfinity and Double.

What is QuantileDiscretizer?

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter.

How does Bucketizer work?

Bucketizer: The bucketizer transforms a column of continuous features to a column of feature buckets. The buckets are decided by the parameter “splits”. A bucket defined by the splits x, y holds values in the range [x, y) except the last bucket which also includes y.

What is spark Vectorassembler?

A feature transformer that merges multiple columns into a vector column.

Why is the number of buckets used in quantilediscretizer smaller than expected?

It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter.

How do I set the number of bins used in quantilediscretizer?

The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 3.0.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter.

How to map multiple columns at once in quantilediscretizer?

Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown.

What is the use of bucketing in spark?

Bucketing is a technique in both Spark and Hive used to optimize the performance of the task. In bucketing buckets (clustering columns) determine data partitioning and prevent data shuffle. Based on the value of one or more bucketing columns, the data is allocated to a predefined number of buckets. Figure 1.1


1 Answers

QuantileDiscretizer determines the bucket splits based on the data.

Bucketizer puts data into buckets that you specify via splits.

So use Bucketizer when you know the buckets you want, and QuantileDiscretizer to estimate the splits for you.

That the outputs are similar in the example is due to the contrived data and the splits chosen. Results may vary significantly in other scenarios.

like image 67
ImDarrenG Avatar answered Oct 12 '22 13:10

ImDarrenG