First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).
I have found that few entries get more than two characters wrong, so the code
Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"
would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write
Name that differs by at most 2 characters from "Massachusetts";
Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.
Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).
Thanks in advance.
(Edit: Improved the title)
You can use the FIND function in SAS to find the position of the first occurrence of some substring within a string. data new_data; set original_data; first_occurrence = find(variable_name, "string", "i"); run; The “i” argument tells SAS to ignore the case when searching for the substring.
In this case, MAX is a SAS function. It works with the WHERE clause because you are comparing the values of two columns within the same row. Consequently, it can be used to subset the data.
The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, you also use an aggregate function in the SELECT clause or in a HAVING clause to instruct PROC SQL in how to summarize the data for each group. PROC SQL calculates the aggregate function separately for each group.
The FIND function searches string for the first occurrence of the specified substring, and returns the position of that substring. If the substring is not found in string, FIND returns a value of 0.
SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.
data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;
proc sql;
select name from typo
where complev(name, "massachusetts") <= 2;
quit;
There are other phonetic algorithms like Hamming distance that should work better. You can search on google for implementation of this algorithm for your specific DB engine.
What you are looking for is "Approximate string matching". For that one can use "Levenshtein distance computing algorithm". I am not sure, but hope that this answer will help
You could implement a stored function of this type (Oracle syntax, transform to your RDBMS):
CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
-- do some comparison here
END distance;
And then use it in SQL:
SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2
Of course, these things tend to be quite slow...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With