it appears that between spark 2.2.1 and spark 2.4.0, the behavior of left join with empty right dataframe changed from succeeding to returning "AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans".
for example:
val emptyDf = spark.emptyDataFrame
.withColumn("id", lit(0L))
.withColumn("brand", lit(""))
val nonemptyDf = ((1L, "a") :: Nil).toDF("id", "size")
val neje = nonemptyDf.join(emptyDf, Seq("id"), "left")
neje.show()
in 2.2.1, the result is
+---+----+-----+
| id|size|brand|
+---+----+-----+
| 1| a| null|
+---+----+-----+
however, in 2.4.0, i get the following exception:
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
here is the full plan explanation for the latter:
> neje.explain(true)
== Parsed Logical Plan ==
'Join UsingJoin(LeftOuter,List(id))
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
: +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L, AS brand#55]
+- Project [0 AS id#53L]
+- LogicalRDD false
== Analyzed Logical Plan ==
id: bigint, size: string, brand: string
Project [id#278L, size#279, brand#55]
+- Join LeftOuter, (id#278L = id#53L)
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
: +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L, AS brand#55]
+- Project [0 AS id#53L]
+- LogicalRDD false
== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
additional observations:
is this a regression or by design? the earlier behavior seems more correct to me. i have not been able to find any relevant information in spark release notes, spark jira issues, or stackoverflow questions.
I didn't have quite your problem, but the same error at least, and I fixed it by explicitly allowing the cross-join:
spark.conf.set( "spark.sql.crossJoin.enabled" , "true" )
I have faced this issue multiple times. The recent one i remember was because i was using a dataframe at multiple actions, so it was recomputing every time. Once i cached it at source this error went away.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With