How to join two RDDs in spark with python?

Tags:

Suppose

rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).

Want to generate

( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).

Any easy methods? I think it is different from the cross join but can't find a good solution. My solution is

(rdd1
 .cartesian( rdd2 )
 .filter( lambda (k, v): k[0]==v[0] )
 .map( lambda (k, v): (k[0], (k[1], v[1])) ))

521

asked Jun 22 '15 20:06

Peng Sun

1 Answers

You are just looking for a simple join, e.g.

rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]

answered Oct 04 '22 13:10

dpeacock

Related questions
                            
                                Spark JSON text field to RDD
                            
                                Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
                            
                                Getting NullPointerException using spark-csv with DataFrames
                            
                                Does a flatMap in spark cause a shuffle?
                            
                                How to use Spark's repartitionAndSortWithinPartitions?
                            
                                Select array element from Spark Dataframes split method in same call?
                            
                                Running yarn with spark not working with Java 8
                            
                                How to read in-memory JSON string into Spark DataFrame
                            
                                Why is the number of partitions after groupBy 200? Why is this 200 not some other number?
                            
                                Convert List into dataframe spark scala
                            
                                Memory efficient cartesian join in PySpark
                            
                                Get IDs for duplicate rows (considering all other columns) in Apache Spark
                            
                                How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
                            
                                How to pass the parameter to User-Defined Function?
                            
                                Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)
                            
                                What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]
                            
                                Spark java : Creating a new Dataset with a given schema
                            
                                Spark returning Pickle error: cannot lookup attribute
                            
                                spark streaming throughput monitoring
                            
                                How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to join two RDDs in spark with python?

Tags:

join

apache-spark

pyspark

Peng Sun

People also ask

1 Answers

dpeacock

Recent Activity

Donate For Us