It is known that GBT s in Spark gives you predicted labels as of now.
I was thinking of trying to calculate predicted probabilities for a class (say all the instances falling under a certain leaf)
The codes to build GBT's
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository
//Parsing the data
val parsedData = data.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1
// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(training, boostingStrategy)
model.toDebugString
This gives me 2 trees of depth 2 as below for simplicity:
Tree 0:
If (feature 3 <= 2.0)
If (feature 2 <= 1.25)
Predict: -0.5752212389380531
Else (feature 2 > 1.25)
Predict: 0.07462686567164178
Else (feature 3 > 2.0)
If (feature 0 <= 30.17)
Predict: 0.7272727272727273
Else (feature 0 > 30.17)
Predict: 1.0
Tree 1:
If (feature 5 <= 67.0)
If (feature 4 <= 100.0)
Predict: 0.5739387416147804
Else (feature 4 > 100.0)
Predict: -0.550117566730937
Else (feature 5 > 67.0)
If (feature 2 <= 0.0)
Predict: 3.0383669122382835
Else (feature 2 > 0.0)
Predict: 0.4332824083446489
My question is: Can I use the above trees to calculate predicted probabilities like:
With respect to every instance in the feature set used for prediction
exp(leaf score from tree 0 + leaf score from tree 1)/(1+exp(leaf score from tree 0 + leaf score from tree 1))
This gives me a kind of probability. But not sure if it is the right way to do it. Also if there is any document explaining how leaf score (prediction) are calculated. I would be really grateful if anybody can share.
Any suggestion would be superb.
Here is my approach using Spark internal dependencies. You will need to import the linear algebra library for the matrix operation later, i.e., multiplying the tree predictions with the learning rate.
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
Say you build a model with GBT:
val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
To calculate the probability using the model object:
// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect
// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect
Here is a code snippet you can copy & paste directly into spark-shell:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect
// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect
def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
val treePredictions = gbdt.trees.map(_.predict(features))
blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
def sigmoid(v : Double) : Double = {
1/(1+Math.exp(-v))
}
// model is output of GradientBoostedTrees.train(...,...)
// testData is libSVM format
val labelAndPreds = testData.map { point =>
var prediction = score(point.features,model)
prediction = sigmoid(prediction)
(point.label, Vectors.dense(1.0-prediction, prediction))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With