I have a question about decision tree in <code>MLlib</code>. What algorithm is used in Spark? Is it ID3, C4.5 or CART?

Spark MLlib is using the ID3 algorithm with CART. ID3 only handles categorical variables and CART can handle continuous variables. Spark decision trees can handle categorical variables, so it is using CART (in the Jira ticket specified below we can see that they haven't implemented C4.5 yet). In this blog post you can find some information about the different algorithms and it is where I got the answer from. You can find a discussion on extending it to C4.5 in this Jira ticket. More information about the difference between the algorithms here.

What algorithm is used in spark decision tree (is ID3, C4.5 or CART)

2 Answers

Spark MLlib is using the ID3 algorithm with CART.

ID3 only handles categorical variables and CART can handle continuous variables. Spark decision trees can handle categorical variables, so it is using CART (in the Jira ticket specified below we can see that they haven't implemented C4.5 yet).

In this blog post you can find some information about the different algorithms and it is where I got the answer from.

You can find a discussion on extending it to C4.5 in this Jira ticket.

More information about the difference between the algorithms here.

128

answered Oct 08 '22 05:10

Ignacio

If you take a look at the link Apache Spark and take a look at the section,

Node impurity and information gain (Basic Algorithm)

You can find

The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance)

Also, if you take a look at the link Decision Tree, you can find CART (classification and regression tree) algorithm uses Gini impurity and entropy for classification and variance reduction for regression.

answered Oct 08 '22 04:10

John Doe

Related questions
                            
                                Reading csv files with missing columns and random column order
                            
                                Best approach to check if Spark streaming jobs are hanging
                            
                                Spark Structured Streaming with Kafka doesn't honor startingOffset="earliest"
                            
                                Why Parquet over some RDBMS like Postgres
                            
                                How to run inference of a pytorch model on pyspark dataframe (create new column with prediction) using pandas_udf?
                            
                                Hadoop + Spark: There are 1 datanode(s) running and 1 node(s) are excluded in this operation
                            
                                how to use sparks implicit conversion (e.g. $) in IntelliJ debugger evaluate expression
                            
                                Connection Refused When Running SparkPi Locally
                            
                                Spark: PageRank example when iteration too large throws stackoverflowError
                            
                                Saving a >>25T SchemaRDD in Parquet format on S3
                            
                                How to use the RangePartitioner in Spark
                            
                                Spark and HBase Snapshots
                            
                                spark 1.4.0 java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
                            
                                Pyspark: shuffle RDD
                            
                                VectorAssembler output only to DenseVector?
                            
                                Spark - Shuffle Read Blocked Time
                            
                                DataFrame partitionBy on nested columns
                            
                                PySpark distributing module imports
                            
                                Spark problems with imports in Python
                            
                                Divide elements of column by a sum of elements (of same column) grouped by elements of another column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What algorithm is used in spark decision tree (is ID3, C4.5 or CART)

Tags:

tree

apache-spark

zhuangxue

People also ask

2 Answers

Ignacio

John Doe

Recent Activity

Donate For Us