Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Apache Spark how can I group all the rows of an RDD by two shared values?

I have an RDD of a custom case object which is of the form

{userId:"h245hv45uh", title: "The-BFG", seen: 1, timestamp: 2016-08-06 13:19:53.051000+0000}

Is there any way I can group all rows which have the same userId and title, and then create a single row in a new RDD with the same userId and title but with all the 'seen' values added?

{userId:"h245hv45uh", title: "The-BFG", seen: 71, timestamp: 2016-08-06 13:19:53.051000+0000}

like that ^ if there were 71 rows which had the same userId and title?

The original RDD has several titles and user IDs and I'm trying to aggregate the score, filtering for matching userIds and titles

Thanks

like image 959
Aaron O'Donnell Avatar asked Jan 29 '26 15:01

Aaron O'Donnell


1 Answers

You can try converting it into a Pair RDD then using reduceByKey:

def combFunc(cc1: CaseClass, cc2: CaseClass): CaseClass = {
  cc1.copy(seen = cc1.seen + cc2.seen)
}

val newRDD = rdd
  .map( i => ((i.userId, i.title), i) ) // converting into a PairRDD
  .reduceByKey(combFunc) // reducing by key
  .values // converting back to an RDD[CaseClass]
like image 141
Daniel de Paula Avatar answered Jan 31 '26 08:01

Daniel de Paula



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!