Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark join raises "Detected cartesian product for INNER join"

Tags:

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:

maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)

And in second string I get an error:

AnalysisException: u'Detected cartesian product for INNER join between logical plans\nProject ... Use the CROSS JOIN syntax to allow cartesian products between these relations.;'

What do I not understand: why spark finds here cartesian product?

A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.

like image 312
Alex Loo Avatar asked Feb 10 '17 08:02

Alex Loo


2 Answers

As described in Why does spark think this is a cross/cartesian join, it may be caused by:

This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.

As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.

like image 186
Frank.Chang Avatar answered Sep 24 '22 11:09

Frank.Chang


Try to persist the dataframes before joining them. Worked for me.

like image 38
Utsav Bhatia Avatar answered Sep 24 '22 11:09

Utsav Bhatia