SQL UNION ALL to eliminate duplicates

Tags:

I found this sample interview question and answer posted on toptal reproduced here. But I don't really understand the code. How can a UNION ALL turn into a UNIION (distinct) like that? Also, why is this code faster?

QUESTION

Write a SQL query using UNION ALL (not UNION) that uses the WHERE clause to eliminate duplicates. Why might you want to do this? Hide answer You can avoid duplicates using UNION ALL and still run much faster than UNION DISTINCT (which is actually same as UNION) by running a query like this:

ANSWER

SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X

The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.

802

asked Jan 18 '17 20:01

user3685285

1 Answers

But in the example, the first query has a condition on column a, whereas the second query has a condition on column b. This probably came from a query that's hard to optimize:

SELECT * FROM mytable WHERE a=X OR b=Y

This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a? Or on column b? Either way, searching the other term requires a table-scan.

Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.

But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.

SELECT * FROM mytable WHERE a=X 
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y

UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ....

We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.

But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.

If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.

SELECT * FROM mytable WHERE a=X 
UNION ALL 
SELECT * FROM mytable WHERE b=Y AND a!=X

These two sets are guaranteed to have no overlap. If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.

The second query therefore only catches some of the rows where b=Y, but any row where a=X AND b=Y is already included in the first set.

So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.

answered Sep 27 '22 21:09

Bill Karwin

Related questions
                            
                                Python mysql.connector cursor.lastrowid always returns 0
                            
                                what does this symbol mean := in sql
                            
                                My Model is NOT an Entity Bean registered with this server?
                            
                                Preventing SQL statements from getting truncated by MySQL's Workbench in `Performance Reports` section
                            
                                Change charset and engine for Doctrine2's ManyToMany relationship's intermediary table
                            
                                PHP & PDO: Connect to MySQL using IPv6 address
                            
                                ImportError: this is MySQLdb version (1, 2, 4, 'beta', 4), but _mysql is version (1, 2, 5, 'final', 1)
                            
                                Improve performance on MySQL fulltext search query
                            
                                How to create database with doctrine2?
                            
                                How to replace multiple values in 1 column in mysql SELECT query using REPLACE()?
                            
                                PHP fastest way to register millions of records in MYSQL
                            
                                Unable to Connect MySQL container to Tomcat Container in docker
                            
                                Mysql Query across servers without using Federated Table
                            
                                Error 1005 "Can't create table (errno: 13)"
                            
                                How to choose a table that have the table name : "group"?
                            
                                DATETIME and TIMESTAMP Length/Values Error
                            
                                Setting an index limit in SQLAlchemy
                            
                                Why does ""=" exploit this MySQL Query?
                            
                                How to double JOIN properly in SQL
                            
                                How to create mysql database with sequelize (nodejs)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL UNION ALL to eliminate duplicates

Tags:

sql

sql-server

mysql

union

union-all

user3685285

People also ask

1 Answers

Bill Karwin

Recent Activity

Donate For Us