We plan to move Apache Pig code to the new Spark platform. Pig has a "Bag/Tuple/Field" concept and behaves similarly to a relational database. Pig provides support for CROSS/INNER/OUTER joins. For CROSS JOIN, we can use alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n]; But as we move to the Spark platform I couldn't find any counterpart in the Spark API. Do you have any idea?

It is <code>oneRDD.cartesian(anotherRDD)</code>.

How to implement "Cross Join" in Spark?

1 Answers

It is oneRDD.cartesian(anotherRDD).

104

answered Sep 21 '22 17:09

Daniel Darabos

Related questions
                            
                                Spark Driver memory and Application Master memory
                            
                                pandasUDF and pyarrow 0.15.0
                            
                                Automatically including jars to PySpark classpath
                            
                                Spark Group By Key to (Key,List) Pair
                            
                                What is the Scala case class equivalent in PySpark?
                            
                                How to add a SparkListener from pySpark in Python?
                            
                                How to fix "Forbidden!Configured service account doesn't have access" with Spark on Kubernetes?
                            
                                How to change SparkContext properties in Interactive PySpark session
                            
                                Flatten Nested Spark Dataframe
                            
                                How to pass a constant value to Python UDF?
                            
                                How to debug a scala based Spark program on Intellij IDEA
                            
                                How to use two versions of spark shell?
                            
                                Partitioning in spark while reading from RDBMS via JDBC
                            
                                Apache Spark: java.lang.NoSuchMethodError .rddToPairRDDFunctions
                            
                                Spark: Inconsistent performance number in scaling number of cores
                            
                                Profiling a Scala Spark application
                            
                                Why is Spark faster than Hadoop Map Reduce
                            
                                Count on Spark Dataframe is extremely slow
                            
                                to_date fails to parse date in Spark 3.0
                            
                                How to implement custom job listener/tracker in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement "Cross Join" in Spark?

Tags:

apache-spark

cross-join

Shawn Guo

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us