I am suck in this one. Wish I could do it in pure sql, but at this point any solution will do.
I have ta
and tb
tables, containing lists of events that occurred approximately at the same time. The goal is to find "orphan" records from ta
on tb
. E.g.:
create table ta ( dt date, id varchar(1));
insert into ta values( to_date('20130101 13:01:01', 'yyyymmdd hh24:mi:ss') , '1' );
insert into ta values( to_date('20130101 13:01:02', 'yyyymmdd hh24:mi:ss') , '2' );
insert into ta values( to_date('20130101 13:01:03', 'yyyymmdd hh24:mi:ss') , '3' );
create table tb ( dt date, id varchar(1));
insert into tb values( to_date('20130101 13:01:5', 'yyyymmdd hh24:mi:ss') , 'a' );
insert into tb values( to_date('20130101 13:01:6', 'yyyymmdd hh24:mi:ss') , 'b' );
But let's say I must use a threshold of +-5 seconds. So, the query to find would look something like:
select
ta.id ida,
tb.id idb
from
ta, tb
where
tb.dt between (ta.dt - 5/86400) and (ta.dt + 5/86400)
order by 1,2
(fiddle: http://sqlfiddle.com/#!4/b58f7c/5)
The rules are:
tb
for a given one in ta
will be considered the correct mapping.That said, the resulting query should return something like
IDA | IDB
1 | a
2 | b
3 | null <-- orphan event
Though the sample query I've put here shows exactly the issue I am having. When the time overlaps, it is difficult to systematically choose the correct row.
dense_rank()
seems to be the answer to select the correct rows, but what partitioning/sorting will place them right?
Worth mentioning, I am doing this on a Oracle 11gR2.
It seems like this should be possible with a single SQL statement using Oracle's analytic functions, perhaps with some combination of row_number(), lag(), and max() over. But I simply couldn't wrap my head around it. I kept on wanting to embed one analytic function within another, and I don't think you can do that. You can go in steps using Common Table Expressions, but I couldn't figure out how to make it work.
But a procedural solution is fairly straight forward using PL*SQL along with an extra table to store your result. I use row_number() to assign a chronological rank to each row in each of your source tables. You want a determinate result, so it's important to have a tie breaker in case you have duplicate date-times, hence my order by of dt, id. Here is a SQL-Fiddle demo.
Or look at the code below:
create table result (
dif number,
ida varchar(1),
idb varchar(1),
dta date,
dtb date
);
declare
prevA integer := 0;
prevB integer := 0;
begin
for rec in (
with
ordered_ta as (
select dt dta,
id ida,
row_number() over (order by dt, id) rowNumA
from ta
),
ordered_tb as (
select dt dtb,
id idb,
row_number() over (order by dt, id) rowNumB
from tb
)
select ta.*,
tb.*,
abs(dta - dtb) * 86400 dif
from ordered_ta ta
join ordered_tb tb
on dtb between (dta - 5/86400) and (dta + 5/86400)
order by rowNumA, rowNumB
)
loop
if rec.rowNumA > prevA and rec.rowNumB > prevB then
prevA := rec.rowNumA;
prevB := rec.rowNumB;
insert into result values (
rec.dif,
rec.ida,
rec.idb,
rec.dta,
rec.dtb
);
end if;
end loop;
end;
/
select * from result
union all
select null dif, id ida, null idb, dt dta, null dtb
from ta
where id not in (select ida from result)
union all
select null dif, null ida, id idb, null dta, dt dtb
from tb
where id not in (select idb from result)
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With