I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol)
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)
The error I get:
scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[String])
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
My resources thus far: tree config, decision tree, labeledpoint
If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is contiuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment.
Categorical variable decision treeA categorical variable decision tree includes categorical target variables that are divided into categories. For example, the categories can be yes or no. The categories mean that every stage of the decision process falls into one category, and there are no in-betweens.
It is a decision tree parameter: maxBins : Number of bins used when discretizing continuous features. Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication.
You can first transform categories to numbers, then load data as if all features are numerical.
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]()
from feature indices to its arity.
For example if you have data as:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
You can first transform data into numerical format as:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.
So for example, if you have the following dataset:
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
Then you could split your string data, making each value of the strings into a new column
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.
Now your dataset will be
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
Which now you can convert into Double values and use it into your LabeledPoint.
Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
a = 0
b = 1
c = 2
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With