Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I run the Spark decision tree with a categorical feature set using Scala?

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.

val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))

// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol) 
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)

The error I get:

scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (Array[String])
       val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))

My resources thus far: tree config, decision tree, labeledpoint

like image 321
Climbs_lika_Spyder Avatar asked Jul 30 '14 13:07

Climbs_lika_Spyder


People also ask

How do you handle categorical features in a decision tree?

If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is contiuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment.

Can decision trees have categorical variables?

Categorical variable decision treeA categorical variable decision tree includes categorical target variables that are divided into categories. For example, the categories can be yes or no. The categories mean that every stage of the decision process falls into one category, and there are no in-betweens.

What is maxBins in decision tree?

It is a decision tree parameter: maxBins : Number of bins used when discretizing continuous features. Increasing maxBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication.


2 Answers

You can first transform categories to numbers, then load data as if all features are numerical.

When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.

For example if you have data as:

1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me

You can first transform data into numerical format as:

1,0,0
2,1,1
1,2,2
3,0,3
1,2,4

In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:

categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))

The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
like image 79
lam Avatar answered Sep 21 '22 06:09

lam


Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.

So for example, if you have the following dataset:

id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887

Then you could split your string data, making each value of the strings into a new column

a -> 1,0,0
b -> 0,1,0
c -> 0,0,1

As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.

Now your dataset will be

id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887

Which now you can convert into Double values and use it into your LabeledPoint.

Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be

a = 0
b = 1
c = 2

But in this case the algorithms will consider a closer to b than to c, which cannot be determined.

like image 39
dirceusemighini Avatar answered Sep 19 '22 06:09

dirceusemighini