How do I run the Spark decision tree with a categorical feature set using Scala?

Tags:

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.

val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))

// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol) 
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)

The error I get:

scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (Array[String])
       val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))

My resources thus far: tree config, decision tree, labeledpoint

321

asked Jul 30 '14 13:07

Climbs_lika_Spyder

2 Answers

You can first transform categories to numbers, then load data as if all features are numerical.

When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.

For example if you have data as:

1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me

You can first transform data into numerical format as:

1,0,0
2,1,1
1,2,2
3,0,3
1,2,4

In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:

categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))

The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

answered Sep 21 '22 06:09

lam

Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.

So for example, if you have the following dataset:

id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887

Then you could split your string data, making each value of the strings into a new column

a -> 1,0,0
b -> 0,1,0
c -> 0,0,1

As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.

Now your dataset will be

id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887

Which now you can convert into Double values and use it into your LabeledPoint.

Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be

a = 0
b = 1
c = 2

But in this case the algorithms will consider a closer to b than to c, which cannot be determined.

answered Sep 19 '22 06:09

dirceusemighini

Related questions
                            
                                spray-json cannot marshal Map[String,String]
                            
                                Scala slick query comparison of a custom user type (enumeration) gives error
                            
                                What is best way to wrap blocking Try[T] in Future[T] in Scala?
                            
                                How to parse and extract information from json array using json4s
                            
                                Access Array column in Spark
                            
                                get TopN of all groups after group by using Spark DataFrame
                            
                                In scala, is there any way to check if an instance is a singleton object or not?
                            
                                Spark merge dataframe with mismatching schemas without extra disk IO
                            
                                How to mock a function within Scala object using Mockito?
                            
                                Count number of Strings that can be converted to Int in a List
                            
                                Spark: Explode a dataframe array of structs and append id
                            
                                Why isn't optional used for instance variables?
                            
                                Getting the string representation of a type at runtime in Scala
                            
                                Scala multiple assignment to existing variable
                            
                                Odd typing bug in Scala
                            
                                Kadane's Algorithm in Scala
                            
                                Scala, extending the iterator
                            
                                Scala generics with <: and multiple traits
                            
                                How to split a string by delimiter from the right?
                            
                                Play framework handling session state

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I run the Spark decision tree with a categorical feature set using Scala?

Tags:

tree

scala

apache-spark

categorical-data

apache-spark-mllib

Climbs_lika_Spyder

People also ask

2 Answers

lam

dirceusemighini

Recent Activity

Donate For Us