Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark - working with 2 RDDs: complement of RDDs

Tags:

apache-spark

I have 2 RDDs, each is read from a different type of log file that have some data in common. So, we have RDD of type EventA and RDD of type EventB, where classes EventA and EventB inherit from Event.

What is the best way to get an RDD of type EventB with distinct events with respect to RDD of type EventA?

Logically if I formulate the question in "set theory", I am interested in the complement of the sets: RDD[EventB] ∖ RDD[EventA]. I intend to use the equals method defined in Event to infer which events are the same.

like image 559
Jenny Avatar asked Aug 04 '14 10:08

Jenny


Video Answer


1 Answers

I think you want subtract or if the important data is in the keys subtractByKey.

basic usage:
rdd.subtractByKey(otherRdd)

This operation is much more efficient when the first RDD is smaller because the first RDD can be kept in memory while the second is streamed. From your question it wasn't clear if you wanted everything in A that isn't in B or everything that isn't in the intersection of A and B. So the solution for the second approach would be to union the result of the two subtractions:

val newRdd = rdd.subtractByKey(otherRdd).union(otherRdd.subtractByKey(rdd))
like image 53
aaronman Avatar answered Sep 20 '22 12:09

aaronman