I'm studying Spark MLlib. While studying DecisionTree, I see following DecisionTree.trainClassifier usage example.
import org.apache.spark.mllib.tree._
val model = DecisionTree.trainClassifier(trainData, 7, Map[int, int](), "gini", 4, 100)
There are 6 parameters here, I don't understand the 3rd(Map), 5th(4) and 6th(100) parameters.
Google says they are categorical feature, lambda and alpha. Can anyone explain them a bit better?
need your kind help
3rd:
categoricalFeaturesInfo:Any features not in this map are treated as continuous.
For example, Map(0 -> 2, 4 -> 10) specifies that feature 0 is binary (taking values 0 or 1) and that feature 4 has 10 categories (values {0, 1, ..., 9}). Note that feature indices are 0-based: features 0 and 4 are the 1st and 5th elements of an instance’s feature vector.
Map[Int,Int] means that all features are numerical type.
5th:
it is easy to understand,it is the (max) depth of the tree.
6th:
maxBins: Number of bins used when discretizing continuous features.
Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication.
Note that the maxBins parameter must be at least the maximum number of categories.
YOU CAN REFER TO THE BOOK "Adcaned Analytics with Spark"(chapter 4.8-4.10) FOR MORE DETAILS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With