Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark scala get uncommon map elements

I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:

val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
  val fields = line.split(",")
  Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}

Then I select 80% of those for my training set:

val train = ratings.sample(false,.8,1)

Is there an easy way to get the test set in a distributed way, I am trying this but fails:

val test = ratings.filter(!_.equals(train.map(_)))
like image 896
venuktan Avatar asked May 18 '26 12:05

venuktan


2 Answers

val test = ratings.subtract(train)
like image 192
Shyamendra Solanki Avatar answered May 21 '26 05:05

Shyamendra Solanki


Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o

Here is the code

  def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
    val rand = new java.util.Random(seed)
    val partitionSeeds = data.partitions.map(partition => rand.nextLong)
    val temp = data.mapPartitionsWithIndex((index, iter) => {
      val partitionRand = new java.util.Random(partitionSeeds(index))
      iter.map(x => (x, partitionRand.nextDouble))

    })
    (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
  }
like image 27
Oussama Avatar answered May 21 '26 05:05

Oussama