I have a dataframe and I want to add for each row new_col=max(some_column0)
grouped by some other column1:
maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between logical plans\nProject ... Use the CROSS JOIN syntax to allow cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.
As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.
Try to persist the dataframes before joining them. Worked for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With