Spark Random Forests: Different results with same seed

Tags:

When running Spark's RandomForest algorithm, I seem to get different splits in the trees on different runs even when using the same seed. Could anyone kindly explain if I am doing something wrong (likely), or the implementation is buggy (which I believe to be unlikely)? Here is the scheme of my run:

//read data into rdd
//convert string rdd to LabeledPoint rdd
// train_LP_RDD is RDD of LabeledPoint
// call random forest
val seed = 123417
val numTrees = 10
val numClasses = 2
val categoricalFeaturesInfo: Map[Int, Int] = Map() 
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 8
val maxBins = 10
val rfmodel = RandomForest.trainClassifier(train_LP_RDD, numClasses, categoricalFeaturesInfo,
                        numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,seed)
println(rfmodel.toDebugString)

On two different runs, the output of this snippet is different. For example, a diff on two results shows the following:

sdiff -bBWs run1.debug run2.debug

If (feature 2 <= 15.96)             |         If (feature 2 <= 16.0)
Else (feature 2 > 15.96)            |         Else (feature 2 > 16.0)
If (feature 2 <= 15.96)             |         If (feature 2 <= 16.0)
Else (feature 2 > 15.96)            |         Else (feature 2 > 16.0)
If (feature 2 <= 33.68)             |         If (feature 2 <= 34.66)
Else (feature 2 > 33.68)            |         Else (feature 2 > 34.66)
If (feature 1 <= 17.0)              |         If (feature 1 <= 16.0)
Else (feature 1 > 17.0)             |         Else (feature 1 > 16.0)

474

asked May 19 '16 14:05

not_so_knowledgeable

1 Answers

Can't tell without more context (and not enough rep to comment), but as Shaido suggested, one cause could be that train_LP_RDD is not deterministic. E.g. if you're doing something like

train_LP_RDD = sc.textFile(path).sample(withReplacement=False, fraction=0.5)

Then you're going to get different samples each time you run trainClassifier, even if you're not re-defining train_LP_RDD.

183

answered Oct 13 '22 21:10

Eric Doi

Related questions
                            
                                Scala: keyword as package name
                            
                                Combining two lists in Scala
                            
                                Force single argument in scala varargs
                            
                                Generics and Constrained Polymorphism versus Subtyping
                            
                                try block scope
                            
                                How to test methods that return Future?
                            
                                java try-with-resource not working with scala
                            
                                How make implicit Ordered on java.time.LocalDate
                            
                                How to download and save a file from the internet using Scala?
                            
                                Spark2.2.1 incompatible Jackson version 2.8.8
                            
                                Why "case class" doesn't need "new" to create a new object
                            
                                How to unzip a zip file using scala?
                            
                                scala 2 dimensional array
                            
                                Is there a way to handle the last case differently in a Scala for loop?
                            
                                Scala: difference between a typeclass and an ADT?
                            
                                How to call a method n times in Scala?
                            
                                Code to enumerate permutations in Scala
                            
                                coin change algorithm in scala using recursion
                            
                                Testing with probabilistic failure of components in Akka (Scala)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Random Forests: Different results with same seed

Tags:

machine-learning

scala

apache-spark

random-forest

not_so_knowledgeable

People also ask

1 Answers

Eric Doi

Recent Activity

Donate For Us