Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the difference between two RDDs in PySpark?

I'm trying to establish a cohort study to track in-app user behavior and I want ask if you have any idea about how i can exclude an element from an RDD 2 which is in RDD 1. Given :

rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])

rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])

For exemple, to have the common element between rdd1 and rdd2, we have just to do :

rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()

Which gives :

[('a', (2, '6play'))]

So, this join will find the common element between rdd1 and rdd2 and take key and values from rdd2 only. I want to do the opposite : find elements which are in rdd2 and not in rdd1, and take key and values from rdd2 only. In other words, I want to get items from rdd2 which aren't present in rdd1. So the expected output is :

("c", "bobo")

Ideas ? Thank you :)

like image 759
Arij SEDIRI Avatar asked Nov 17 '16 14:11

Arij SEDIRI


People also ask

What is the difference between RDDs and paired RDDs?

Spark Paired RDDs are nothing but RDDs containing a key-value pair. Unpaired RDDs consists of any type of objects. However, paired RDDs (key-value) attains few special operations in it. Such as, distributed “shuffle” operations, grouping or aggregating the elements the key.

What are the different ways to create RDDs with examples?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

What are RDDs in Pyspark?

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

What is pair RDD in Pyspark?

Paired RDD is a distributed collection of data with the key-value pair. It is a subset of Resilient Distributed Dataset So it has all the features of RDD and some new feature for the key-value pair. There are many transformation operations available for Paired RDD.


1 Answers

I just got the answer and it's very simple !

rdd2.subtractByKey(rdd1).collect()

Enjoy :)

like image 125
Arij SEDIRI Avatar answered Dec 05 '22 11:12

Arij SEDIRI