Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark 2.4.0 gives "Detected implicit cartesian product" exception for left join with empty right DF

it appears that between spark 2.2.1 and spark 2.4.0, the behavior of left join with empty right dataframe changed from succeeding to returning "AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans".

for example:

val emptyDf = spark.emptyDataFrame
  .withColumn("id", lit(0L))
  .withColumn("brand", lit(""))
val nonemptyDf = ((1L, "a") :: Nil).toDF("id", "size")
val neje = nonemptyDf.join(emptyDf, Seq("id"), "left")
neje.show()

in 2.2.1, the result is

+---+----+-----+
| id|size|brand|
+---+----+-----+
|  1|   a| null|
+---+----+-----+

however, in 2.4.0, i get the following exception:

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

here is the full plan explanation for the latter:

> neje.explain(true)

== Parsed Logical Plan ==
'Join UsingJoin(LeftOuter,List(id))
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
:  +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L,  AS brand#55]
   +- Project [0 AS id#53L]
      +- LogicalRDD false

== Analyzed Logical Plan ==
id: bigint, size: string, brand: string
Project [id#278L, size#279, brand#55]
+- Join LeftOuter, (id#278L = id#53L)
   :- Project [_1#275L AS id#278L, _2#276 AS size#279]
   :  +- LocalRelation [_1#275L, _2#276]
   +- Project [id#53L,  AS brand#55]
      +- Project [0 AS id#53L]
         +- LogicalRDD false

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

additional observations:

  • if only the left dataframe is empty, the join succeeds.
  • similar change in behavior is true for a right join with an empty left dataframe.
  • however, interestingly, note that both versions fail with the AnalysisException for an inner join if both dataframes are empty.

is this a regression or by design? the earlier behavior seems more correct to me. i have not been able to find any relevant information in spark release notes, spark jira issues, or stackoverflow questions.

like image 923
pedrito maynard-zhang Avatar asked May 27 '19 17:05

pedrito maynard-zhang


2 Answers

I didn't have quite your problem, but the same error at least, and I fixed it by explicitly allowing the cross-join:

spark.conf.set( "spark.sql.crossJoin.enabled" , "true" )
like image 166
Paul Avatar answered Sep 18 '22 17:09

Paul


I have faced this issue multiple times. The recent one i remember was because i was using a dataframe at multiple actions, so it was recomputing every time. Once i cached it at source this error went away.

like image 25
Raj Hans Avatar answered Sep 20 '22 17:09

Raj Hans