I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.
SELECT field
FROM test1
WHERE field NOT IN (SELECT field FROM test2)
After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN
predicate subquery.
To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN
predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2))
.
Spark SQL introduces OR isNULL(test1=test2)
to ensure the correct semantics of NOT IN
.
However, the OR
of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin
is BroadcastNestedLoopJoin
. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2)
which causes a better physical join operator for NOT EXISTS (5 mins to finish).
So far I am lucky since my dataset currently does not have any NULL
attributes, but it may have in the future and the semantics of NOT IN is what I really want.
So I check query plan of Presto, It does not really provides Left AntiJoin
but it uses SemiJoin
with a FilterPredicate = not (expr)
. The query plan of Presto does not provide too much info like Spark.
So my question is more like:
Could I assume Presto has a better physical join operator to handle NOT IN
operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2)
to ensure the correct semantics of NOT IN in the logical plan level.
Presto is more commonly used to support interactive SQL queries. Queries are usually analytical but can perform SQL-based ETL. Spark is more general in its applications, often used for data transformation and Machine Learning workloads.
Presto queries can generally run faster than Spark queries because Presto has no built-in fault-tolerance. Spark does support fault-tolerance and can recover data if there's a failure in the process, but actively planning for failure creates overhead that impacts Spark's query performance.
Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster.
Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.
I am actually the person who implemented NULL
treatment for semi join (IN
predicate) in Presto.
Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN
correctly in the presence of NULL
s on either side of the IN
, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL
values didn't exist at all.
If you want to learn more about Presto internals, join the #dev
channel on Presto Community Slack.
¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With