I have two versions of the same dataset and I need to do a full join to find records missing in one of them, both have some missing records. I've managed to come up with two ways of doing it but both have disadvantages. My dataset sizes and filter conditions are very big.
Solution 1 has a drawback of using CTE, which will split the filters and make the code much harder to read, I would like to have just one query:
create table #temp (id int, vers nvarchar(1))
insert into #temp select 1,'a' union select 2,'a' union select 3,'a'
union select 1,'b' union select 2,'b' union select 100,'b'
;WITH vers_a as (SELECT * FROM #temp WHERE vers = 'a')
,vers_b as (SELECT * FROM #temp WHERE vers = 'b')
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM vers_a ta
FULL JOIN vers_b tb on ta.id = tb.id
WHERE ta.id is null or tb.id is null
drop table #temp
Solution 2 duplicates the filters and the execution plan is bigger:
create table #temp (id int, vers nvarchar(1))
insert into #temp select 1,'a' union select 2,'a' union select 3,'a'
union select 1,'b' union select 2,'b' union select 100,'b'
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM #temp ta
FULL JOIN #temp tb on ta.id = tb.id and ta.vers = 'a' and tb.vers = 'b'
WHERE (ta.id is null or tb.id is null) and (ta.vers = 'a' or tb.vers = 'b')
drop table #temp
So my question is, is it possible to have something like solution 2 but without the double condition definitions and with smaller execution plan like in solution 1?
Edit: When running both solutions in one query I can see that solution 2 cots 26% and solution 1 costs 45%, though it has smaller execution plan. I would like the faster solution (not necessarily with smaller execution plan like I said in the question) without code duplication if possible.
Edit2: Sorry for misleading 1st edit, I'm not good at optimizing :) I tested this on a ~1.5mil rows set and solution 1 is faster, to get the set used this:
create table #temp (id int, vers nvarchar(1))
insert into #temp select 1,'a' union select 2,'a' union select 3,'a'
union select 1,'b' union select 2,'b' union select 100,'b'
while (select count(*) from #temp) < 1000000
begin
insert into #temp select id+ABS(CHECKSUM(NewId()))%10000, vers from #temp
end
This should have a good plan. Index on vers may help.
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM (SELECT * FROM #temp WHERE vers = 'a') ta
FULL JOIN (SELECT * FROM #temp WHERE vers = 'b') tb on ta.id = tb.id
WHERE (ta.id is null or tb.id is null)
EDIT Made some tests. The above query has better CPU then 2 other versions.
-- SETUP
drop table temp;
go
create table temp (
id int
,vers nvarchar(1));
insert temp(id,vers)
select top(100000)
row_number() over(order by (select null)) / 2
, case ABS(CHECKSUM(NewId())) % 2 when 0 then 'a' else 'b' end
from sys.all_objects t, sys.all_objects t1 ;
create index idx_temp_vers on temp(vers) include(id)
with
fillfactor=90;
select top(50) *
from temp;
-- TEST RUNS
SET STATISTICS TIME ON;
print ' 1 index query 1 '
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM (SELECT * FROM temp WHERE vers = 'a') ta
FULL JOIN (SELECT * FROM temp WHERE vers = 'b') tb on ta.id = tb.id
WHERE (ta.id is null or tb.id is null)
;
print ' 1 index query 2 '
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM temp ta
FULL JOIN temp tb on ta.id = tb.id and ta.vers = 'a' and tb.vers = 'b'
WHERE (ta.id is null or tb.id is null) and (ta.vers = 'a' or tb.vers = 'b')
;
print ' 1 index query 3 '
SELECT ta.id, TA.vers
from temp ta
where ta.vers = 'a'
and TA.id NOT IN(SELECT tb.id FROM temp tb WHERE tb.vers = 'b')
UNION ALL
SELECT tb.id, Tb.vers
from temp tb
where tb.vers = 'b'
and Tb.id NOT IN(SELECT ta.id FROM temp ta WHERE ta.vers = 'a')
-- One more index
create index idx_temp_id on temp(id)
with
fillfactor=90;
print ' 2 indexes query 1 '
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM (SELECT * FROM temp WHERE vers = 'a') ta
FULL JOIN (SELECT * FROM temp WHERE vers = 'b') tb on ta.id = tb.id
WHERE (ta.id is null or tb.id is null)
;
print ' 2 indexes query 2 '
SELECT ta.id, tb.id, ta.vers, tb.vers
FROM temp ta
FULL JOIN temp tb on ta.id = tb.id and ta.vers = 'a' and tb.vers = 'b'
WHERE (ta.id is null or tb.id is null) and (ta.vers = 'a' or tb.vers = 'b')
;
print ' 2 indexes query 3 '
SELECT ta.id, TA.vers
from temp ta
where ta.vers = 'a'
and TA.id NOT IN(SELECT tb.id FROM temp tb WHERE tb.vers = 'b')
UNION ALL
SELECT tb.id, Tb.vers
from temp tb
where tb.vers = 'b'
and Tb.id NOT IN(SELECT ta.id FROM temp ta WHERE ta.vers = 'a')
SET STATISTICS TIME OFF;
Results
1 index query 1
(49898 row(s) affected)
SQL Server Execution Times:
CPU time = 156 ms, elapsed time = 3825 ms.
1 index query 2
(49898 row(s) affected)
SQL Server Execution Times:
CPU time = 281 ms, elapsed time = 2962 ms.
1 index query 3
(49898 row(s) affected)
SQL Server Execution Times:
CPU time = 422 ms, elapsed time = 2508 ms.
2 indexes query 1
(49898 row(s) affected)
SQL Server Execution Times:
CPU time = 172 ms, elapsed time = 2679 ms.
2 indexes query 2
(49898 row(s) affected)
SQL Server Execution Times:
CPU time = 406 ms, elapsed time = 3468 ms.
2 indexes query 3
(49898 row(s) affected)
SQL Server Execution Times:
CPU time = 407 ms, elapsed time = 3728 ms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With