I need to join 2 table in pyspark and do this join not on exact value from right table, but on nearest value (as there is no exact match.
It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1
In regular SQL :
SELECT a.*,
(SELECT b.field2 FROM tableB b
WHERE b.field1 = a.field1
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) as field2
FROM tableA a
ORDER BY a.field1
Working fine
in SparkSQL:
...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")
query = "SELECT a.*, \
(SELECT b.field2 FROM tableB b \
WHERE b.field1 = a.field1 \
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) field2 \
FROM tableA a \
ORDER BY a.field1"
result_DF = spark.sql(query)
I am having the following exception:
pyspark.sql.utils.AnalysisException: u'Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses
If Spark 2.2.1 does not support it what would be the work around?
Thank you in advance, Gary
Your best bet is to refactor the query such that your field2 value is output from a joined table rather than a scalar correlated subquery. For example:
...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")
query = "SELECT a.*, \
FIRST(b.field2) OVER (ORDER BY ABS(b.field2 - a.field2)) field2 \
FROM tableA a \
JOIN tableB b
ON a.field1 = b.field1 \
ORDER BY a.field1"
result_DF = spark.sql(query)
Catalyst is able to resolve the query plan when written in this manner. Note that the above is tested on Spark 2.3.1 - some recently-introduced windowing capabilities may be required for this to work.
For anyone who may not have the luxury of redefining JOIN or WHERE clauses, Spark 2.4 may include some new functionality for correlated subquery processing: https://issues.apache.org/jira/browse/SPARK-18455
Update: unfortunately, the correlated subquery mentioned in SPARK-18455 slipped to target version 3.0.0 on 9/11/18. At this point, it's very unlikely we'll see a 2.x release containing that feature, and historically, Spark minor features spec'd for base major releases have slipped off the roadmap. I'd consider correlated subqueries mostly off the Spark team's roadmap for the moment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With