NOT IN implementation of Presto v.s Spark SQL

Tags:

I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.

SELECT field 
FROM test1 
WHERE field NOT IN (SELECT field FROM test2)

After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

Spark SQL introduces OR isNULL(test1=test2) to ensure the correct semantics of NOT IN.

However, the OR of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin is BroadcastNestedLoopJoin. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2) which causes a better physical join operator for NOT EXISTS (5 mins to finish).

So far I am lucky since my dataset currently does not have any NULL attributes, but it may have in the future and the semantics of NOT IN is what I really want.

So I check query plan of Presto, It does not really provides Left AntiJoin but it uses SemiJoin with a FilterPredicate = not (expr). The query plan of Presto does not provide too much info like Spark.

So my question is more like:

Could I assume Presto has a better physical join operator to handle NOT IN operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2) to ensure the correct semantics of NOT IN in the logical plan level.

750

asked Nov 06 '19 17:11

Bostonian

1 Answers

I am actually the person who implemented NULL treatment for semi join (IN predicate) in Presto.

Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN correctly in the presence of NULLs on either side of the IN, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL values didn't exist at all.

If you want to learn more about Presto internals, join the #dev channel on Presto Community Slack.

¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.

183

answered Sep 21 '22 14:09

Piotr Findeisen

Related questions
                            
                                How do I check whether a field contains null value? - pymongo
                            
                                Passing `null` reference for a `ref struct` parameter in interop method
                            
                                Win32 application. HBITMAP LoadImage fails to load anything
                            
                                Call seems ambiguous, but runs perfectly with unexpected output
                            
                                Why does MySQL ignore null values when looking for not equal?
                            
                                DateTime.hasvalue vs datetime == null, which one is better and why [duplicate]
                            
                                Why does printing of a (nil) map in golang yield a non "<nil>" result?
                            
                                In C# 8, how do I detect impossible null checks?
                            
                                What is the purpose of the NullObject class in Groovy?
                            
                                Visual Studio 2010 Debugging "if (var == NULL)" not triggering
                            
                                Adding null values to an array
                            
                                where is __null defined in g++?
                            
                                Caught Throwable or Exception is null
                            
                                Entity framework returns null for a row if the first column in that row is null
                            
                                Java Map returns null for a present key
                            
                                Are a combobox's items null when empty?
                            
                                Is disk space consumed when storing null data?
                            
                                JDBC: Connection returning NULL, what to do?
                            
                                Passing splat on nil as argument
                            
                                difference between new type[0] and null - java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NOT IN implementation of Presto v.s Spark SQL

Tags:

null

apache-spark-sql

isnull

presto

Bostonian

People also ask

1 Answers

Piotr Findeisen

Recent Activity

Donate For Us