spark join raises "Detected cartesian product for INNER join"

Question

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:

maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)

And in second string I get an error:

AnalysisException: u'Detected cartesian product for INNER join between logical plans Project ... Use the CROSS JOIN syntax to allow cartesian products between these relations.;'

What do I not understand: why spark finds here cartesian product?

A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.

Frank.Chang · Accepted Answer

As described in Why does spark think this is a cross/cartesian join, it may be caused by:

This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.

As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.

Utsav Bhatia · Answer

Try to persist the dataframes before joining them. Worked for me.

spark join raises "Detected cartesian product for INNER join"

Tags:

Alex Loo

2 Answers

Frank.Chang

Utsav Bhatia

Recent Activity

Donate For Us

spark join raises "Detected cartesian product for INNER join"

Tags:

Alex Loo

2 Answers

Frank.Chang

Utsav Bhatia

Related questions

Recent Activity

Donate For Us