Difference between QuantileDiscretizer and Bucketizer in Spark

Tags:

apache-spark

pyspark

This may be a novice question, however I'm unable to comprehend if there is any specific advantage of using QuantileDiscretizer over Bucketizerin spark 2.1 ?

I understand that QuantileDiscretizer is an estimator and handles NAN values whereas Bucketizer is a transformer and raises error if data has NAN values.

from the spark documentation , below code produces similar outputs

from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import Bucketizer

data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)]
df = spark.createDataFrame(data, ["id", "hour"])

result_discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour",outputCol="result").fit(df).transform(df)
result_discretizer.show()

splits = [-float("inf"),3, 10,float("inf")]
result_bucketizer = Bucketizer(splits=splits, inputCol="hour",outputCol="result").transform(df)
result_bucketizer.show()

Output :

+---+----+------+
| id|hour|result|
+---+----+------+
|  0|18.0|   2.0|
|  1|19.0|   2.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   0.0|
+---+----+------+

+---+----+------+
| id|hour|result|
+---+----+------+
|  0|18.0|   2.0|
|  1|19.0|   2.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   0.0|
+---+----+------+

Please let me know if there is any significant advantage of one over other?

830

asked Apr 13 '17 07:04

Nim J

1 Answers

QuantileDiscretizer determines the bucket splits based on the data.

Bucketizer puts data into buckets that you specify via splits.

So use Bucketizer when you know the buckets you want, and QuantileDiscretizer to estimate the splits for you.

That the outputs are similar in the example is due to the contrived data and the splits chosen. Results may vary significantly in other scenarios.

answered Oct 12 '22 13:10

ImDarrenG

Related questions
                            
                                What happens if the driver program crashes?
                            
                                sbt - exclude certain dependency only during publish
                            
                                Implementing custom Spark RDD in Java
                            
                                Spark MLLib Kmeans from dataframe, and back again
                            
                                Spark __getnewargs__ error
                            
                                Spark: driver/worker configuration. Does driver run on Master node?
                            
                                More than one hour to execute pyspark.sql.DataFrame.take(4)
                            
                                spark.driver.extraClassPath Multiple Jars
                            
                                Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?
                            
                                How to use from_json with schema as string (i.e. a JSON-encoded schema)?
                            
                                Spark: count percentage percentages of a column values
                            
                                TypeError: 'Column' object is not callable using WithColumn
                            
                                The purpose of ClosureCleaner.clean
                            
                                How to get WebUI URI from SparkContext
                            
                                how to deal with error SPARK-5063 in spark
                            
                                'Connection Refused' error while running Spark Streaming on local machine
                            
                                Spark write Parquet to S3 the last task takes forever
                            
                                What is the difference between Spark DataSet and RDD
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                YARN: What is the difference between number-of-executors and executor-cores in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With