spark-scala: Filter RDD if the record of the RDD doesn't exist in another RDD

Question

I have an RDD that has the structure as follows:

((user_id,item_id,rating))

lets call this RDD as training

Then there is another rdd with the same structure:

((user_id,item_id,rating))

and this rdd as test

I want to make sure data that is in test doesn't appear in train per user basis. So lets say

train = {u1,item2: u1,item4 : u1,item3} test={u1,item2:u1, item5}

I want to make sure item2 is removed from u1 training data.

so what I started doing is groupBy both rdd(s) (user_id, item_id)

 val groupedTrainData = trainData.groupBy(x => (x._1, x._2))

But I feel like this is not the way to go.

Daniel Darabos · Accepted Answer

You need PairRDDFunctions.subtractByKey:

def cleanTrain(
  train: RDD[((user, item), rating)],
  test: RDD[((user, item), rating)]) =
  train.subtractByKey(test)

Donate For Us