Is there any difference between these two performance-wise?
-- eliminate duplicates using UNION
SELECT col1,col2,col3 FROM Table1
UNION SELECT col1,col2,col3 FROM Table2
UNION SELECT col1,col2,col3 FROM Table3
UNION SELECT col1,col2,col3 FROM Table4
UNION SELECT col1,col2,col3 FROM Table5
UNION SELECT col1,col2,col3 FROM Table6
UNION SELECT col1,col2,col3 FROM Table7
UNION SELECT col1,col2,col3 FROM Table8
-- eliminate duplicates using DISTINCT
SELECT DISTINCT * FROM
(
SELECT col1,col2,col3 FROM Table1
UNION ALL SELECT col1,col2,col3 FROM Table2
UNION ALL SELECT col1,col2,col3 FROM Table3
UNION ALL SELECT col1,col2,col3 FROM Table4
UNION ALL SELECT col1,col2,col3 FROM Table5
UNION ALL SELECT col1,col2,col3 FROM Table6
UNION ALL SELECT col1,col2,col3 FROM Table7
UNION ALL SELECT col1,col2,col3 FROM Table8
) x
UNION retrieves only distinct records from all queries or tables, whereas UNION ALL returns all the records retrieved by queries. Performance of UNION ALL is higher than UNION.
UNION ALL does return duplicates: this results in a faster query and could be useful for those who want to know what is in both SELECT statements. UNION DISTINCT is used in scenarios when we need unique based on the where conditions in the query.
DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.
Yes, the application needs to compare every record to the "distinct" records cache as it goes. You can improve performance by using an index, particularly on the numeric and date fields.
The difference between Union and Union all is that UNION ALL
will not eliminate duplicate rows, instead it just pulls all rows from all tables fitting your query specifics and combines them into a table.
A UNION statement effectively does a SELECT DISTINCT
on the results set.
If you select Distinct from Union All result set, Then the output will be equal to the Union result set.
Edit:
Performance on CPU cost:
Let me explain with Example:
I have two queries. one is Union another one is Union All
SET STATISTICS TIME ON
GO
select distinct * from (select * from dbo.user_LogTime
union all
select * from dbo.user_LogTime) X
GO
SET STATISTICS TIME OFF
SET STATISTICS TIME ON
GO
select * from dbo.user_LogTime
union
select * from dbo.user_LogTime
GO
SET STATISTICS TIME OFF
I did run the both in same query window in SMSS. Lets see the Execution Plan in SMSS:
What happens is, The query with Union All and Distinct will take CPU cost more than Query with Union.
Performance on Time:
UNION ALL
:
(1172 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 39 ms.
UNION
:
(1172 row(s) affected)
SQL Server Execution Times:
CPU time = 10 ms, elapsed time = 25 ms.
So Union is much better than the Union All with Distinct in performance-wise
This query is used to create an extended employee table with additional alternate IDs for downstream system. This example comes from a mySQL 8.0.20 enviornment.
For the data and query shown below, tests yielded a significant difference:
UNION ALL 8.983 sec
UNION DISTINCT 15.344 sec
To show scale and complexity for this example, table sizes and the query code are shown below
hqsource 600K records
accountingemppos 180K
accountingposld 200K
emp_no_accountingnumeric 20
First UNION block is approx 550K records, second approx 50K
SELECT a.`emp_no_imported` AS `emp_no`,
a.`supervisor_emp_no`,
a.`first name`,
a.`middle name`,
a.`last name`,
a.`jobtitle`,
a.`status`,
CASE WHEN rida.`accounting_emp_no` IS NOT NULL THEN
rida.`accounting_emp_no`
ELSE
a.`emp_no_imported`
END AS `accounting_id`,
CASE WHEN epfp.`emp_no` IS NOT NULL THEN
CASE WHEN `sridf`.`emp_no` IS NOT NULL THEN
`sridf`.`accounting_emp_no`
ELSE
epfp.`emp_no`
END
ELSE
CASE WHEN epp.`emp_no` IS NOT NULL THEN
CASE WHEN `srids`.`emp_no` IS NOT NULL THEN
`srids`.`accounting_emp_no`
ELSE
epp.`emp_no`
END
ELSE
CASE WHEN `srida`.`emp_no` IS NOT NULL THEN
`srida`.`accounting_emp_no`
ELSE
a.`supervisor_emp_no`
END
END
END AS `accounting_s_emp_no`,
ep.`emp_no` AS `traas_emp_no`,
epp.`emp_no` AS `traas_parent_emp_no`
FROM `hqsource`.hq_people a
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `rida` ON `rida`.emp_no = a.`emp_no_imported`
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `srida` ON `srida`.emp_no = a.`supervisor_emp_no`
LEFT OUTER JOIN `traas`.`accountingemppos_data_extract` ep ON ep.`emp_no` = a.`emp_no_imported` AND ep.`End` = '2899-12-31' AND ep.`Primary` = 'Y'
LEFT OUTER JOIN `epe`.`accountingposld_data_extract` p ON p.`RangeGID` = ep.`GID`
LEFT OUTER JOIN `traas`.`accountingemppos_data_extract` epp ON epp.`GID` = p.`ParentGID` AND epp.`End` = '2899-12-31' AND epp.`Primary` = 'Y'
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `rids` ON `rids`.emp_no = ep.`emp_no`
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `srids` ON `srids`.emp_no = epp.`emp_no` AND epp.`End` = '2899-12-31' AND epp.`Primary` = 'Y'
LEFT OUTER JOIN `epe`.`accountingemppos_data_extract_filtered` epf ON epf.`emp_no` = a.`emp_no_imported` AND epf.`End` = '2899-12-31' AND epf.`Primary` = 'Y'
LEFT OUTER JOIN `epe`.`accountingposld_data_extract` pf ON pf.`RangeGID` = epf.`GID`
LEFT OUTER JOIN `epe`.`accountingemppos_data_extract_filtered` epfp ON epfp.`GID` = pf.`ParentGID` AND epfp.`End` = '2899-12-31' AND epfp.`Primary` = 'Y'
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `ridf` ON `ridf`.emp_no = epf.`emp_no` AND epf.`End` = '2899-12-31' AND epf.`Primary` = 'Y'
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `sridf` ON `sridf`.emp_no = epfp.`emp_no` AND epfp.`End` = '2899-12-31' AND epfp.`Primary` = 'Y'
WHERE a.`emp_no_imported` REGEXP ('^[a-z]{2}\\d{5}.$')
UNION ALL
-- UNION DISTINCT
SELECT a.`emp_no_imported` AS `emp_no`, a.`supervisor_emp_no` AS `s_emp_no`, u.`First_Name`, 'ƒ' AS `MI`, u.`Last_Name`, u.`Job_Title`, NULL AS `status`,
CASE WHEN rid.`accounting_emp_no` IS NULL THEN
ep.`emp_no`
ELSE
rid.`accounting_emp_no`
END AS `accounting_emp_no`,
CASE WHEN `srid`.`accounting_emp_no` IS NULL THEN
epp.`emp_no`
ELSE
`srid`.`accounting_emp_no`
END AS `accounting_s_emp_no`,
ep.`emp_no` AS `traas_emp_no`,
epp.`emp_no` AS `traas_parent_emp_no`
FROM `epe`.`accountingemppos_data_extract_filtered` ep
LEFT OUTER JOIN `hqsource`.`hq_people` a ON a.`emp_no_imported` = ep.`emp_no`
LEFT OUTER JOIN `epe`.`accountingposld_data_extract` p ON p.`RangeGID` = ep.`GID`
LEFT OUTER JOIN `epe`.`accountingemppos_data_extract_filtered` epp ON epp.`GID` = p.`ParentGID`
LEFT OUTER JOIN `siebel`.`users_all_output` u ON u.`LOGIN` = ep.`emp_no`
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `rid` ON `rid`.emp_no = ep.`emp_no`
LEFT OUTER JOIN `hqsource`.`emp_no_accountingnumeric` `srid` ON `srid`.emp_no = epp.`emp_no`
WHERE
ep.`End` = '2899-12-31' AND
epp.`End` = '2899-12-31' AND
p.`End` = '2899-12-31' AND
ep.emp_no REGEXP ('^F\\d{8}$|^V[0-3]\\d{5}$')
ORDER BY LENGTH(accounting_emp_no) ASC, accounting_emp_no ASC
;
The WHERE clauses in each of the two UNION blocks guarantee the results will be unique. (This query is years old and runs daily. I wish I had tried this sooner). Field names have been obfuscated
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With