Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithm is used in spark decision tree (is ID3, C4.5 or CART)

I have a question about decision tree in MLlib. What algorithm is used in Spark? Is it ID3, C4.5 or CART?

like image 229
zhuangxue Avatar asked Dec 07 '16 08:12

zhuangxue


People also ask

What are ID3 C4 5 and cart?

ID3,CART and C4. 5 are basically most common decision tree algorithms in data mining which use different splitting criteria for splitting the node at each level to form a homogeneous(i.e. it contains objectsbelonging to the same category) node.

Is C4 5 a decision tree algorithm?

The C4. 5 algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors).

Is ID3 decision tree algorithm?

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is typically used in the machine learning and natural language processing domains.

Which is better ID3 C4 5 or cart?

CART (Classification and Regression Trees) is very similar to C4. 5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yields the largest information gain at each node.


2 Answers

Spark MLlib is using the ID3 algorithm with CART.

ID3 only handles categorical variables and CART can handle continuous variables. Spark decision trees can handle categorical variables, so it is using CART (in the Jira ticket specified below we can see that they haven't implemented C4.5 yet).

In this blog post you can find some information about the different algorithms and it is where I got the answer from.

You can find a discussion on extending it to C4.5 in this Jira ticket.

More information about the difference between the algorithms here.

like image 128
Ignacio Avatar answered Oct 08 '22 05:10

Ignacio


If you take a look at the link Apache Spark and take a look at the section,

Node impurity and information gain (Basic Algorithm)

You can find

The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance)

Also, if you take a look at the link Decision Tree, you can find CART (classification and regression tree) algorithm uses Gini impurity and entropy for classification and variance reduction for regression.

like image 24
John Doe Avatar answered Oct 08 '22 04:10

John Doe