Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SAS PROC SQL: How to quickly search if a variable contain a full substring?

Tags:

sas

proc-sql

I have a problem like this at work:

Column Code has the value like, 1000, 1200, A1000, B1200, AAA, BBB, etc. Currently it is separated by spaces, sometimes more than one due to poor data input. And I am trying to check if a record contain a code that I am interested in.

Interested_Code: 1000 or A1000 or 444 or 555 or A555 etc.

I know a simple solution from this answer:

A.CODE LIKE CAT('% ', T3.Interested_Code, ' %')

I have appended a leading and trailing space to A.CODE to ensure a "full" exact match are returned. Because if I simply do

A.CODE LIKE CAT('%', T3.Interested_Code, '%') or
A.CODE CONTAINS T3.Interested_Code

I will get a false positive for code = 1000 at a row contained code = A1000, this matches part of the code, but not necessary a correct result.

My code works above, but it is doing too many test and really slow. Is there a faster or smarter way in PROC SQL? The main table is about 100k rows, and each row has around 10-20 codes. The interested code is about 8k values. Thanks.

like image 779
George Avatar asked Mar 06 '23 07:03

George


1 Answers

You could use FINDW or INDEXW, which find "words" (by default, things separated by spaces or similar). That is probably better than your solution, in particular because you won't find

"1000 "

since it doesn't start with a space, the way you are doing it.

proc sql;
  create table final_codes as
  select codes.*
  from codes where exists (
    select 1 from interested_codes
    where findw(codes.code,trim(interested_codes.code)) > 0)
  ;
quit;

However, this is effectively a cartesian join, and very slow. It has to join all possible combinations - 8000 times 100,000, or effectively 800 million temporary rows before it subsets down. It's just not going to be all that fast no matter what you do.

Doing this in a data step would be more efficient, in particular as you can more easily stop once you find a match. You can put the interested_codes table into a hash table or a temporary array, and then depending on your match frequency it may be faster to search each code in the interested_codes table, or the reverse, but either way stop when you find a match (instead of doing all possible combinations).

like image 73
Joe Avatar answered Apr 29 '23 19:04

Joe