Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NOT IN implementation of Presto v.s Spark SQL

I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.

SELECT field 
FROM test1 
WHERE field NOT IN (SELECT field FROM test2)

After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

Spark SQL introduces OR isNULL(test1=test2) to ensure the correct semantics of NOT IN.

However, the OR of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin is BroadcastNestedLoopJoin. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2) which causes a better physical join operator for NOT EXISTS (5 mins to finish).

So far I am lucky since my dataset currently does not have any NULL attributes, but it may have in the future and the semantics of NOT IN is what I really want.

So I check query plan of Presto, It does not really provides Left AntiJoin but it uses SemiJoin with a FilterPredicate = not (expr). The query plan of Presto does not provide too much info like Spark.

So my question is more like:

Could I assume Presto has a better physical join operator to handle NOT IN operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2) to ensure the correct semantics of NOT IN in the logical plan level.

like image 750
Bostonian Avatar asked Nov 06 '19 17:11

Bostonian


People also ask

How is Presto different from Spark?

Presto is more commonly used to support interactive SQL queries. Queries are usually analytical but can perform SQL-based ETL. Spark is more general in its applications, often used for data transformation and Machine Learning workloads.

Why is Presto faster than Spark SQL?

Presto queries can generally run faster than Spark queries because Presto has no built-in fault-tolerance. Spark does support fault-tolerance and can recover data if there's a failure in the process, but actively planning for failure creates overhead that impacts Spark's query performance.

Does Presto use Spark?

Presto on Spark runs as a library that is submitted with spark-submit or Jar Task on the Spark cluster.

Which is faster Spark or SQL?

Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.


1 Answers

I am actually the person who implemented NULL treatment for semi join (IN predicate) in Presto.

Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN correctly in the presence of NULLs on either side of the IN, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL values didn't exist at all.

If you want to learn more about Presto internals, join the #dev channel on Presto Community Slack.

¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.

like image 183
Piotr Findeisen Avatar answered Sep 21 '22 14:09

Piotr Findeisen