What are DecisionTree.trainClassifier parameters in Spark

Question

I'm studying Spark MLlib. While studying DecisionTree, I see following DecisionTree.trainClassifier usage example.

import org.apache.spark.mllib.tree._
val model = DecisionTree.trainClassifier(trainData, 7, Map[int, int](), "gini", 4, 100)

There are 6 parameters here, I don't understand the 3rd(Map), 5th(4) and 6th(100) parameters.

Google says they are categorical feature, lambda and alpha. Can anyone explain them a bit better?

need your kind help

Song · Accepted Answer

3rd:
categoricalFeaturesInfo:Any features not in this map are treated as continuous.

For example, Map(0 -> 2, 4 -> 10) specifies that feature 0 is binary (taking values 0 or 1) and that feature 4 has 10 categories (values {0, 1, ..., 9}). Note that feature indices are 0-based: features 0 and 4 are the 1st and 5th elements of an instance’s feature vector.

Map[Int,Int] means that all features are numerical type.

5th:
it is easy to understand,it is the (max) depth of the tree.

6th:
maxBins: Number of bins used when discretizing continuous features.

Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication.

Note that the maxBins parameter must be at least the maximum number of categories.

YOU CAN REFER TO THE BOOK "Adcaned Analytics with Spark"(chapter 4.8-4.10) FOR MORE DETAILS.

What are DecisionTree.trainClassifier parameters in Spark

Tags:

scala

apache-spark

apache-spark-mllib

Jin Park

1 Answers

Song

Recent Activity

Donate For Us

What are DecisionTree.trainClassifier parameters in Spark

Tags:

scala

apache-spark

apache-spark-mllib

Jin Park

1 Answers

Song

Related questions

Recent Activity

Donate For Us