Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement "Cross Join" in Spark?

We plan to move Apache Pig code to the new Spark platform.

Pig has a "Bag/Tuple/Field" concept and behaves similarly to a relational database. Pig provides support for CROSS/INNER/OUTER joins.

For CROSS JOIN, we can use alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];

But as we move to the Spark platform I couldn't find any counterpart in the Spark API. Do you have any idea?

like image 878
Shawn Guo Avatar asked Jul 21 '14 05:07

Shawn Guo


People also ask

How does cross join work in Spark?

5) Cross Join: Cross join basically computes a cartesian product of 2 tables. Say you have m rows in 1 table n rows in another, this would give (m*n) rows. So imagine a small table 10,000 customer table joined with a products table of 1000 records would give an exploding 10,000,000 records!

Can we perform cross join in DataFrame in Pyspark?

Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN.

How do you do a join in Spark?

The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios are implemented in Spark SQL based upon the business use case. Some of the joins require high resource and computation efficiency.

What is Leftanti join in Spark?

Anti Join. An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.


1 Answers

It is oneRDD.cartesian(anotherRDD).

like image 104
Daniel Darabos Avatar answered Sep 21 '22 17:09

Daniel Darabos