I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way, I am trying this but fails:
val test = ratings.filter(!_.equals(train.map(_)))
val test = ratings.subtract(train)
Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = data.partitions.map(partition => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index))
iter.map(x => (x, partitionRand.nextDouble))
})
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With