Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are DecisionTree.trainClassifier parameters in Spark

I'm studying Spark MLlib. While studying DecisionTree, I see following DecisionTree.trainClassifier usage example.

import org.apache.spark.mllib.tree._
val model = DecisionTree.trainClassifier(trainData, 7, Map[int, int](), "gini", 4, 100)

There are 6 parameters here, I don't understand the 3rd(Map), 5th(4) and 6th(100) parameters.

Google says they are categorical feature, lambda and alpha. Can anyone explain them a bit better?

need your kind help

like image 413
Jin Park Avatar asked Sep 12 '25 22:09

Jin Park


1 Answers

3rd:
categoricalFeaturesInfo:Any features not in this map are treated as continuous.

For example, Map(0 -> 2, 4 -> 10) specifies that feature 0 is binary (taking values 0 or 1) and that feature 4 has 10 categories (values {0, 1, ..., 9}). Note that feature indices are 0-based: features 0 and 4 are the 1st and 5th elements of an instance’s feature vector.

Map[Int,Int] means that all features are numerical type.

5th:
it is easy to understand,it is the (max) depth of the tree.

6th:
maxBins: Number of bins used when discretizing continuous features.

Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication.

Note that the maxBins parameter must be at least the maximum number of categories.

YOU CAN REFER TO THE BOOK "Adcaned Analytics with Spark"(chapter 4.8-4.10) FOR MORE DETAILS.

like image 122
Song Avatar answered Sep 15 '25 18:09

Song