Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL - query inside NOT IN takes longer than the complete query?

Tags:

sql

oracle

I'm using NOT IN inside my SQL query.

For example:

select columnA 
from table1
where columnA not in (
select columnB
from table2)

How is it possible that this part of the query

select columnB
from table2

takes 30sec to complete, but the whole query above takes 0.1sec to complete?? Shouldn't the complete query take 30sec + ?

BTW, both queries return valid results.

Thanks!

Answers to Comments

Is it because the second query hasn't actually completed but has only returned back the first 'x' rows (out of a very large table?)

No, the query is completed after 30 seconds, not to many rows returned (eg. 50).

But @Aleksandar wondered why the question congaing the performance killer was so fast.

my point exactly

Also how long does select distinct columnB from table2 take to execute?

actually, the original query is "select distinct...

like image 694
Alex Avatar asked Jan 05 '11 13:01

Alex


People also ask

Why is MySQL query taking so long?

Queries can become slow for various reasons ranging from improper index usage to bugs in the storage engine itself. However, in most cases, queries become slow because developers or MySQL database administrators neglect to monitor them and keep an eye on their performance.

What is true about using not in when writing queries with sub queries in them?

When using NOT IN , the subquery returns a list of zero or more values in the outer query where the comparison column does not match any of the values returned from the subquery. Using the previous example, NOT IN returns all the products that are not supplied from MA.


3 Answers

It seems you are thinking that your main query implies the following steps:

(1)  Run the subquery
(2)  Check each row in table1 against the result set from the subquery.

Therefore, you think that running the subquery separately must take less time than running the whole query.

But SQL is not a procedural language, and the structure of the query does not necessarily imply the steps that will be followed to execute the query.

As Guffa answered, the optimizer will come up with (what it believes is) an optimal plan to execute each query. These execution plans are not always obvious from looking at the query, and in some cases can indeed be quite counter-intuitive.

I think that it is most likely, in this case, that the optimizer has come up with a quicker method for checking whether a value exists in table2 than simply querying all of table2 at once. It could be the transformation Guffa showed (although that still does not tell you the exact execution plan being used).

I would guess that table1 has significantly fewer rows than table2, and an index exists on table2.columnB. So all it has to do is fetch the rows from table1, then probe the index for each of those value to check for existence. But this is only one possibility.

Also, as Michael Buen pointed out, differences in the size of the result set returned can also impact your perceived performance. My intuition is that this is secondary to the execution plan differences, but it can be significant.

like image 104
Dave Costa Avatar answered Nov 15 '22 06:11

Dave Costa


It's because the query optimiser turns the query to something that looks completely different. The actual query should be the same as what's produced by a query like this:

select columnA 
from table1
left join table2 on ColumnA = ColumnB
where ColumnB is null

If the database can use indexes to join the tables, perhaps it doesn't have to query the entire table2, or even touch the table itself.

like image 25
Guffa Avatar answered Nov 15 '22 05:11

Guffa


A dramatic comparison, let's say this...

select columnB
from table2

...has billion rows (30 seconds), many data travels the wire and presented to user.

And this...

select columnA 
from table1

...has only one row.

RDBMS won't do a dumb job of pulling the data of table2 from server to client if you don't intend to display the data of table2. So no network bandwidth or I/O will be involved much when doing data presence testing, it all happens at the server, the only thing that would be pulled from server to client is the one row of table1 only.

select columnA 
from table1
where columnA not in (
select columnB
from table2)

And things shall be especially fast if your columnA and columnB happen to have an index

Things that would make a database operation slow are twofold: first is when you pull too much data from the server to client, second is when you don't have an index on pertinent fields

like image 29
Michael Buen Avatar answered Nov 15 '22 04:11

Michael Buen