Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to CROSS JOIN 2 dataframe?

I am struggling to get the CROSS JOIN of 2 data frame. I am using spark 2.0. How can I implement CROSSS JOIN with 2 data frame.?

Edit:

val df=df.join(df_t1, df("Col1")===df_t1("col")).join(df2,joinType=="cross join").where(df("col2")===df2("col2"))
like image 565
Miruthan Avatar asked Feb 10 '17 11:02

Miruthan


4 Answers

Use crossJoin if no condition needs to be specified

Here is an extract of working code :

people.crossJoin(area).show()
like image 97
Ravish Avatar answered Nov 19 '22 00:11

Ravish


Upgrade to latest Version of spark-sql_2.11 version 2.1.0 and use the function .crossJoin of Dataset

like image 26
Nischay Avatar answered Nov 18 '22 23:11

Nischay


Call join with the other dataframe without using a join condition.

Have a look at the following example. Given first dataframe of people:

+---+------+-------+------+
| id|  name|   mail|idArea|
+---+------+-------+------+
|  1|  Jack|[email protected]|     1|
|  2|Valery|[email protected]|     1|
|  3|  Karl|[email protected]|     2|
|  4|  Nick|[email protected]|     2|
|  5|  Luke|[email protected]|     3|
|  6| Marek|[email protected]|     3|
+---+------+-------+------+

and second dataframe of areas:

+------+--------------+
|idArea|      areaName|
+------+--------------+
|     1|Amministration|
|     2|        Public|
|     3|         Store|
+------+--------------+

the cross join is simply given by:

val cross = people.join(area)
+---+------+-------+------+------+--------------+
| id|  name|   mail|idArea|idArea|      areaName|
+---+------+-------+------+------+--------------+
|  1|  Jack|[email protected]|     1|     1|Amministration|
|  1|  Jack|[email protected]|     1|     3|         Store|
|  1|  Jack|[email protected]|     1|     2|        Public|
|  2|Valery|[email protected]|     1|     1|Amministration|
|  2|Valery|[email protected]|     1|     3|         Store|
|  2|Valery|[email protected]|     1|     2|        Public|
|  3|  Karl|[email protected]|     2|     1|Amministration|
|  3|  Karl|[email protected]|     2|     2|        Public|
|  3|  Karl|[email protected]|     2|     3|         Store|
|  4|  Nick|[email protected]|     2|     3|         Store|
|  4|  Nick|[email protected]|     2|     2|        Public|
|  4|  Nick|[email protected]|     2|     1|Amministration|
|  5|  Luke|[email protected]|     3|     2|        Public|
|  5|  Luke|[email protected]|     3|     3|         Store|
|  5|  Luke|[email protected]|     3|     1|Amministration|
|  6| Marek|[email protected]|     3|     1|Amministration|
|  6| Marek|[email protected]|     3|     2|        Public|
|  6| Marek|[email protected]|     3|     3|         Store|
+---+------+-------+------+------+--------------+
like image 4
pheeleeppoo Avatar answered Nov 19 '22 01:11

pheeleeppoo


You might have to enable crossJoin in the spark confs. Example:

spark = SparkSession
.builder
.appName("distance_matrix")
.config("spark.sql.crossJoin.enabled",True)
.getOrCreate()

and use something like this:

df1.join(df2, <condition>)
like image 1
nimish Avatar answered Nov 19 '22 01:11

nimish