I'm trying to establish a cohort study to track in-app user behavior and I want ask if you have any idea about how i can exclude an element from an RDD 2 which is in RDD 1. Given : <pre class="prettyprint"><code>rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)]) rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")]) </code></pre> For exemple, to have the common element between rdd1 and rdd2, we have just to do : <pre class="prettyprint"><code>rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect() </code></pre> Which gives : <pre class="prettyprint"><code>[('a', (2, '6play'))] </code></pre> So, this join will find the common element between rdd1 and rdd2 and take key and values from rdd2 only. I want to do the opposite : find elements which are in rdd2 and not in rdd1, and take key and values from rdd2 only. In other words, I want to get items from rdd2 which aren't present in rdd1. So the expected output is : <pre class="prettyprint"><code>("c", "bobo") </code></pre> Ideas ? Thank you :)

I just got the answer and it's very simple ! <pre class="prettyprint"><code>rdd2.subtractByKey(rdd1).collect() </code></pre> Enjoy :)

How to get the difference between two RDDs in PySpark?

Tags:

mapreduce

I'm trying to establish a cohort study to track in-app user behavior and I want ask if you have any idea about how i can exclude an element from an RDD 2 which is in RDD 1. Given :

rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])

rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])

For exemple, to have the common element between rdd1 and rdd2, we have just to do :

rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()

Which gives :

[('a', (2, '6play'))]

So, this join will find the common element between rdd1 and rdd2 and take key and values from rdd2 only. I want to do the opposite : find elements which are in rdd2 and not in rdd1, and take key and values from rdd2 only. In other words, I want to get items from rdd2 which aren't present in rdd1. So the expected output is :

("c", "bobo")

Ideas ? Thank you :)

759

asked Nov 17 '16 14:11

Arij SEDIRI

1 Answers

I just got the answer and it's very simple !

rdd2.subtractByKey(rdd1).collect()

Enjoy :)

125

answered Dec 05 '22 11:12

Arij SEDIRI

Related questions
                            
                                Does Spark distributes dataframe across nodes internally?
                            
                                How to specify batch interval in Spark Structured Streaming?
                            
                                How to concatenate multiple columns in PySpark with a separator?
                            
                                Spark Window aggregation vs. Group By/Join performance
                            
                                How do I split a column by using delimiters from another column in Spark/Scala
                            
                                MapReduce or Spark for Batch processing on Hadoop?
                            
                                How to create a bigram from a text file with frequency count in Spark/Scala?
                            
                                Run spark SQL on CHD5.4.1 NoClassDefFoundError
                            
                                Running a Spark Application in Intellij 14.1.3
                            
                                In Spark's client mode, the driver needs network access to remote executors?
                            
                                How to Validate contents of Spark Dataframe
                            
                                Accessing nested data in spark
                            
                                Broadcast Annoy object in Spark (for nearest neighbors)?
                            
                                Adding the resulting TFIDF calculation to the dataframe of the original documents in Pyspark
                            
                                Selecting values from non-null columns in a PySpark DataFrame
                            
                                Spark: Expansion of RDD(Key, List) to RDD(Key, Value)
                            
                                Access Spark broadcast variable in different classes
                            
                                How to normalize or standardize the data having multiple columns/variables in spark using scala?
                            
                                Apache Spark writing to s3 failing to move parquet files from temporary folder
                            
                                Scala: Spark SQL to_date(unix_timestamp) returning NULL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the difference between two RDDs in PySpark?

Tags:

apache-spark

rdd

apache-spark-sql

pyspark

mapreduce

Arij SEDIRI

People also ask

1 Answers

Arij SEDIRI

Recent Activity

Donate For Us