I have a problem like this at work:
Column Code
has the value like, 1000, 1200, A1000, B1200, AAA, BBB, etc. Currently it is separated by spaces, sometimes more than one due to poor data input. And I am trying to check if a record contain a code that I am interested in.
Interested_Code
: 1000 or A1000 or 444 or 555 or A555 etc.
I know a simple solution from this answer:
A.CODE LIKE CAT('% ', T3.Interested_Code, ' %')
I have appended a leading and trailing space to A.CODE
to ensure a "full" exact match are returned. Because if I simply do
A.CODE LIKE CAT('%', T3.Interested_Code, '%') or
A.CODE CONTAINS T3.Interested_Code
I will get a false positive for code = 1000
at a row contained code = A1000
, this matches part of the code, but not necessary a correct result.
My code works above, but it is doing too many test and really slow. Is there a faster or smarter way in PROC SQL? The main table is about 100k rows, and each row has around 10-20 codes. The interested code is about 8k values. Thanks.
You could use FINDW
or INDEXW
, which find "words" (by default, things separated by spaces or similar). That is probably better than your solution, in particular because you won't find
"1000 "
since it doesn't start with a space, the way you are doing it.
proc sql;
create table final_codes as
select codes.*
from codes where exists (
select 1 from interested_codes
where findw(codes.code,trim(interested_codes.code)) > 0)
;
quit;
However, this is effectively a cartesian join, and very slow. It has to join all possible combinations - 8000 times 100,000, or effectively 800 million temporary rows before it subsets down. It's just not going to be all that fast no matter what you do.
Doing this in a data step would be more efficient, in particular as you can more easily stop once you find a match. You can put the interested_codes table into a hash table or a temporary array, and then depending on your match frequency it may be faster to search each code in the interested_codes table, or the reverse, but either way stop when you find a match (instead of doing all possible combinations).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With