Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Random Forests: Different results with same seed

When running Spark's RandomForest algorithm, I seem to get different splits in the trees on different runs even when using the same seed. Could anyone kindly explain if I am doing something wrong (likely), or the implementation is buggy (which I believe to be unlikely)? Here is the scheme of my run:

//read data into rdd
//convert string rdd to LabeledPoint rdd
// train_LP_RDD is RDD of LabeledPoint
// call random forest
val seed = 123417
val numTrees = 10
val numClasses = 2
val categoricalFeaturesInfo: Map[Int, Int] = Map() 
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 8
val maxBins = 10
val rfmodel = RandomForest.trainClassifier(train_LP_RDD, numClasses, categoricalFeaturesInfo,
                        numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,seed)
println(rfmodel.toDebugString)

On two different runs, the output of this snippet is different. For example, a diff on two results shows the following:

sdiff -bBWs run1.debug run2.debug

If (feature 2 <= 15.96)             |         If (feature 2 <= 16.0)
Else (feature 2 > 15.96)            |         Else (feature 2 > 16.0)
If (feature 2 <= 15.96)             |         If (feature 2 <= 16.0)
Else (feature 2 > 15.96)            |         Else (feature 2 > 16.0)
If (feature 2 <= 33.68)             |         If (feature 2 <= 34.66)
Else (feature 2 > 33.68)            |         Else (feature 2 > 34.66)
If (feature 1 <= 17.0)              |         If (feature 1 <= 16.0)
Else (feature 1 > 17.0)             |         Else (feature 1 > 16.0)
like image 474
not_so_knowledgeable Avatar asked May 19 '16 14:05

not_so_knowledgeable


People also ask

Why Random Forest gives different results?

This is to say that many trees, constructed in a certain “random” way form a Random Forest. Each tree is created from a different sample of rows and at each node, a different sample of features is selected for splitting. Each of the trees makes its own individual prediction.

How does number of trees affect random forest?

If the number of predictors is large but the number of trees is too small, then some features can (theoretically) be missed in all subspaces used. Both cases results in the decrease of random forest predictive power.

Do random forests generalize better?

Random Forest is amongst the best performing Machine Learning algorithms, which has seen wide adoption. While it is a bit harder to interpret than a single Decision Tree model, it brings many advantages, such as improved performance and better generalization.

Why random forest gives high accuracy?

Random forest improves on bagging because it decorrelates the trees with the introduction of splitting on a random subset of features. This means that at each split of the tree, the model considers only a small subset of features rather than all of the features of the model.


1 Answers

Can't tell without more context (and not enough rep to comment), but as Shaido suggested, one cause could be that train_LP_RDD is not deterministic. E.g. if you're doing something like

train_LP_RDD = sc.textFile(path).sample(withReplacement=False, fraction=0.5)

Then you're going to get different samples each time you run trainClassifier, even if you're not re-defining train_LP_RDD.

like image 183
Eric Doi Avatar answered Oct 13 '22 21:10

Eric Doi