Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

Tags:

regex

sql

sas

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).

I have found that few entries get more than two characters wrong, so the code

Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"

would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write

Name that differs by at most 2 characters from "Massachusetts";

Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.

Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).

Thanks in advance.

(Edit: Improved the title)

like image 895
Har Avatar asked Apr 26 '11 13:04

Har


People also ask

How do I find a specific string in SAS?

You can use the FIND function in SAS to find the position of the first occurrence of some substring within a string. data new_data; set original_data; first_occurrence = find(variable_name, "string", "i"); run; The “i” argument tells SAS to ignore the case when searching for the substring.

Which of the following clauses can be used to compare the values in PROC SQL?

In this case, MAX is a SAS function. It works with the WHERE clause because you are comparing the values of two columns within the same row. Consequently, it can be used to subset the data.

How do I use group by in SAS?

The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, you also use an aggregate function in the SELECT clause or in a HAVING clause to instruct PROC SQL in how to summarize the data for each group. PROC SQL calculates the aggregate function separately for each group.

What is the Find function in SAS?

The FIND function searches string for the first occurrence of the specified substring, and returns the position of that substring. If the substring is not found in string, FIND returns a value of 0.


4 Answers

SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.

data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;

proc sql;
  select name from typo
  where complev(name, "massachusetts") <= 2;
quit;
like image 194
cmjohns Avatar answered Oct 05 '22 21:10

cmjohns


There are other phonetic algorithms like Hamming distance that should work better. You can search on google for implementation of this algorithm for your specific DB engine.

like image 24
shrutyzet Avatar answered Oct 05 '22 22:10

shrutyzet


What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help

like image 40
Draco Ater Avatar answered Oct 05 '22 20:10

Draco Ater


You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):

CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
  -- do some comparison here
END distance;

And then use it in SQL:

SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2

Of course, these things tend to be quite slow...

like image 22
Lukas Eder Avatar answered Oct 05 '22 22:10

Lukas Eder