This may be a novice question, however I'm unable to comprehend if there is any specific advantage of using QuantileDiscretizer
over Bucketizer
in spark 2.1 ?
I understand that QuantileDiscretizer
is an estimator and handles NAN values whereas Bucketizer
is a transformer and raises error if data has NAN values.
from the spark documentation , below code produces similar outputs
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import Bucketizer
data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)]
df = spark.createDataFrame(data, ["id", "hour"])
result_discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour",outputCol="result").fit(df).transform(df)
result_discretizer.show()
splits = [-float("inf"),3, 10,float("inf")]
result_bucketizer = Bucketizer(splits=splits, inputCol="hour",outputCol="result").transform(df)
result_bucketizer.show()
Output :
+---+----+------+
| id|hour|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4| 2.2| 0.0|
+---+----+------+
+---+----+------+
| id|hour|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4| 2.2| 0.0|
+---+----+------+
Please let me know if there is any significant advantage of one over other?
Bucketizer is used to transform a column of continuous features to a column of feature buckets. We specify the n+1 splits parameter for mapping continuous features into n buckets. The splits should be in a strictly increasing order. Typically, we add Double. NegativeInfinity and Double.
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter.
Bucketizer: The bucketizer transforms a column of continuous features to a column of feature buckets. The buckets are decided by the parameter “splits”. A bucket defined by the splits x, y holds values in the range [x, y) except the last bucket which also includes y.
A feature transformer that merges multiple columns into a vector column.
It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter.
The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 3.0.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown.
Bucketing is a technique in both Spark and Hive used to optimize the performance of the task. In bucketing buckets (clustering columns) determine data partitioning and prevent data shuffle. Based on the value of one or more bucketing columns, the data is allocated to a predefined number of buckets. Figure 1.1
QuantileDiscretizer
determines the bucket splits based on the data.
Bucketizer
puts data into buckets that you specify via splits
.
So use Bucketizer
when you know the buckets you want, and QuantileDiscretizer
to estimate the splits for you.
That the outputs are similar in the example is due to the contrived data and the splits
chosen. Results may vary significantly in other scenarios.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With