I attend a database course at my school. The teacher gave us a simple exercise: consider the following, simple schema:
Table Book:
Column title (primary key)
Column genre (one of: "romance", "polar", ...)
Table Author:
Column title (foreign key on Book.title)
Column name
Primary key on (title, name)
Among the questions was the following one:
Write the query that returns the authors who have written romance books.
I proposed this answer:
select distinct name
from Author where title in (select title from Book where genre = "romance")
However the teacher said it was wrong, and that the correct answer was:
select distinct name
from Book, Author
where Book.title = Author.title
and genre = "romance"
When I asked for explanations all I got was a "if you had paid more attention to the course you would know why". Brilliant.
So, why is my answer incorrect? What exactly is the difference between these queries? What exactly do they do, on the DB engine level?
In SQL we can retrieve data from multiple tables also by using SELECT with multiple tables which actually results in CROSS JOIN of all the tables. The resulting table occurring from CROSS JOIN of two contains all the row combinations of the 2nd table which is a Cartesian product of tables.
In SQL, to fetch data from multiple tables, the join operator is used. The join operator adds or removes rows in the virtual table that is used by SQL server to process data before the other steps of the query consume the data.
Use the Find Unmatched Query Wizard to compare two tables One the Create tab, in the Queries group, click Query Wizard. In the New Query dialog box, double-click Find Unmatched Query Wizard. On the first page of the wizard, select the table that has unmatched records, and then click Next.
So, why is my answer incorrect?
You answer is correct.
My guess why the teacher marked it as wrong, that he/she tried to practise the use of joins with that question. But that should have been part of the question if it was intended.
What exactly is the difference between these queries
Technically they are different indeed. A DBMS with a simple query optimizer will retrieve the subselect in a different way than the join from your teacher's answer.
I wouldn't be surprised if a DBMS with good optimizer might actually come up with the same execution plan for both queries.
I created some testdata with 50000 books, 50000 authors and 7 different genres to test (smaller numbers don't really make sense as the optimizers tend to simply grab the whole table then). The statement would return 7144 rows.
The execution plans are nearly identical with some small change in the "join" method.
Here is the plan for the sub-select version: http://explain.depesz.com/s/eov
Here is the plan for the join version: http://explain.depesz.com/s/aTI
Surprisingly, the join version has a slightly higher cost value.
Both plans are 100% identical:
-------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 6815 | 399K| | 273 (2)| 00:00:04 | | 1 | HASH UNIQUE | | 6815 | 399K| 464K| 273 (2)| 00:00:04 | |* 2 | HASH JOIN | | 6815 | 399K| | 172 (2)| 00:00:03 | |* 3 | TABLE ACCESS FULL| BOOK | 6815 | 166K| | 69 (2)| 00:00:01 | | 4 | TABLE ACCESS FULL| AUTHOR | 50000 | 1708K| | 103 (1)| 00:00:02 | --------------------------------------------------------------------------------------
Looking at the statistics when using autotrace
there is also no difference whatsoever. I didn't bother to actually create a trace file to analyze it as I don't expect to see a difference there.
Things don't really change if an index on book.genre
is added. Oracle sticks with the full table scan (even with 100000 rows). Probably because the tables are not very wide and a lot of rows fit on a single page.
PostgreSQL does use the index for both statements but there is still no real difference between the plans.
Both queries are valid and return the same.
Your teacher uses quite outdated (though still valid) join syntax, and you are using the construct which is less efficient in some databases (MySQL
, for instance).
If I were your teacher, I would write the query as this:
SELECT DISTINCT name
FROM books b
JOIN authors a
ON a.title = b.title
WHERE b.genre = 'romance'
but still accept both your and your teacher's queries, if the course was not specific to MySQL
optimization.
Can't it be what the teacher meant when he/she said about paying attention?
Update:
On the DB engine level both queries would be optimized to use the same plan, except if the DB engine is MySQL
.
In MySQL
, your query would be forced to use Authors
as a leading table, while for you teacher's query, the optimizer can choose which table to make leading depending on the table statistics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With