I'm trying to rewrite the below query to replace the 'IN' clause with inner join
select * from employee_rec er
inner join ed_claim_recd ed on er.ssn=ed.insssn and substr(er.group_rec_key,1,10) = substr(er.group_rec_key,1,10)
and ed.claim in (select claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101')
and ed.insssn in (select er1.ssn from employee_rec er1 where er1.status != 'ACTIV' and trim(ER1.CLAIMNO) is null)
and er.sysind not in ('ABC,'BCD')
Below is what I could come up with, but the results are not same as the previous query
select * from employee_rec er
inner join ed_claim_recd ed on er.ssn = ed.insssn and substr(er.group_rec_key, 1, 10) = substr(er.group_rec_key, 1, 10)
inner join (select claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101') cr on ed.claim = cr.claimno
inner join (
select insssn from ed_claim_recd ed2
inner join (
select ssn from employee_rec er1
where
er1.status != 'ACTIV'
and trim(ER1.CLAIMNO) is null
) er2 on ed2.insssn = er2.ssn
) ed3 on ed.insssn = ed3.insssn
and er.sysind not in ('ABC', 'BCD')
Is this the right approach to rewrite the query or am I way too off? Also, is it an efficient way to rewrite the query to replace "IN' with 'INNER JOIN'?
IN subquery and INNER JOIN work differently. Join will output for each join key from one table all rows with the same key from joined table. So, Join can duplicate rows if a join key in the joined table is not unique. IN subquery will not duplicate rows.
For example if in your cr join subquery
inner join (select claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101') cr on ed.claim = cr.claimno
the claimno is not unique, then joined rows with matching claimno will be duplicated. This is quite normal Join behavior.
To avoid such duplication, make sure that join key is unique by adding DISTINCT, row_number() filter, group by, etc:
inner join (select DISTINCT claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101') cr on ed.claim = cr.claimno
And the same for other such joins.
In this case the result of IN and Join should be the same.
BTW you do not need all these conditions together:
where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101'
because '2130101' is greater than '0000000' and if cd.closedt >= '2130101' it cannot be NULL. cd.closedt >= '2130101' is quite enough.
Have found one more possible problem:
and trim(ER1.CLAIMNO) is null
In Hive (and you have tagged your question with @hive tag) empty string and null are two different things.
('' is not NULL) = true in Hive.
I'd suggest to replace it with and (ER.CLAIMNO is null or trim(ER1.CLAIMNO)='')
Empty string is a normal value in Hive, that is why empty strings participate in joins. Convert to NULL or filter them before join if you do not need them to be joined.
ed3 subquery contains redundant join, it is not like original IN subquery.
Maybe there are other issues. Test all joins one by one to find all of them
I make it this (untested):
select er.*, ed.*
from employee_rec er
join ed_claim_recd ed
on ed.insssn = er.ssn
and substr(ed.group_rec_key,1,10) /* was er.group_rec_key */ = substr(er.group_rec_key,1,10)
join cd_claim_recd cd
on cd.claimno = ed.claim
join employee_rec er1
on er1.ssn = ed.insssn
where er.sysind not in ('ABC', 'BCD')
-- and cd.closedt is not null -- redundant
-- and cd.closedt != '0000000' -- redundant
and cd.closedt >= '2130101'
and er1.status != 'ACTIV'
and trim(er1.claimno) is null
If cd_claim_recd.claimno and employee_rec.ssn and not unique keys then you may need some dedupe logic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With