Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark difference between two RDDs

Say I have this example job (in Groovy w/ Java API):

def set1 = []
def set2 = []
0.upto(10) { set1 << it }
8.upto(20) { set2 << it }
def rdd1 = context.parallelize(set1)
def rdd2 = context.parallelize(set2)

//What next?

How do I get a set that is the delta between the two? I know that union can create a RDD that has all of the data in those RDDs, but how do I do the opposite of that?

like image 896
Mike Thomsen Avatar asked May 30 '26 19:05

Mike Thomsen


1 Answers

If you just want a set subtraction subtract would be an answer. If you want the "outer" collection try:

rdd1.subtract(rdd2).union(rdd2.subtract(rdd1))
like image 97
Dawid Wysakowicz Avatar answered Jun 02 '26 15:06

Dawid Wysakowicz