I'm trying to establish a cohort study to track in-app user behavior and I want ask if you have any idea about how i can exclude an element from an RDD 2 which is in RDD 1. Given :
rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])
rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])
For exemple, to have the common element between rdd1 and rdd2, we have just to do :
rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()
Which gives :
[('a', (2, '6play'))]
So, this join will find the common element between rdd1 and rdd2 and take key and values from rdd2 only. I want to do the opposite : find elements which are in rdd2 and not in rdd1, and take key and values from rdd2 only. In other words, I want to get items from rdd2 which aren't present in rdd1. So the expected output is :
("c", "bobo")
Ideas ? Thank you :)
Spark Paired RDDs are nothing but RDDs containing a key-value pair. Unpaired RDDs consists of any type of objects. However, paired RDDs (key-value) attains few special operations in it. Such as, distributed “shuffle” operations, grouping or aggregating the elements the key.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
Paired RDD is a distributed collection of data with the key-value pair. It is a subset of Resilient Distributed Dataset So it has all the features of RDD and some new feature for the key-value pair. There are many transformation operations available for Paired RDD.
I just got the answer and it's very simple !
rdd2.subtractByKey(rdd1).collect()
Enjoy :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With