I have an RDD that has the structure as follows:
((user_id,item_id,rating))
lets call this RDD as training
Then there is another rdd with the same structure:
((user_id,item_id,rating))
and this rdd as test
I want to make sure data that is in test doesn't appear in train per user basis. So lets say
train = {u1,item2: u1,item4 : u1,item3} test={u1,item2:u1, item5}
I want to make sure item2 is removed from u1 training data.
so what I started doing is groupBy both rdd(s) (user_id, item_id)
val groupedTrainData = trainData.groupBy(x => (x._1, x._2))
But I feel like this is not the way to go.
You need PairRDDFunctions.subtractByKey
:
def cleanTrain(
train: RDD[((user, item), rating)],
test: RDD[((user, item), rating)]) =
train.subtractByKey(test)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With