Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

Tags:

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).

I have found that few entries get more than two characters wrong, so the code

Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"

would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write

Name that differs by at most 2 characters from "Massachusetts";

Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.

Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).

Thanks in advance.

(Edit: Improved the title)

895

asked Apr 26 '11 13:04

Har

4 Answers

SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.

data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;

proc sql;
  select name from typo
  where complev(name, "massachusetts") <= 2;
quit;

194

answered Oct 05 '22 21:10

CREATE FUNCTION distance(one VARCHAR2, two VARCHAR2) RETURN NUMBER IS
DETERMINISTIC
BEGIN
  -- do some comparison here
END distance;

And then use it in SQL:

SELECT * FROM table WHERE distance(name, 'Massachusetts') <= 2

Of course, these things tend to be quite slow...

answered Oct 05 '22 22:10

Lukas Eder

Related questions
                            
                                Why are BOOLEAN type columns problematic in relational database design?
                            
                                IIS connecting to LocalDB
                            
                                Synchronizing table data across databases
                            
                                Pivoting data in MS Access
                            
                                GROUP_CONCAT in SQLite
                            
                                SUM total time in SQL Server [duplicate]
                            
                                Pairwise array sum aggregate function?
                            
                                Create user postgres on Ubuntu
                            
                                Get Entire Hierarchy of Parents From a Given Child in Postgresql
                            
                                PostgreSQL constraint using prefixes
                            
                                sqlcmd how to run a query against specific database?
                            
                                In SQL, what’s the difference between count(*) and count('x')? [duplicate]
                            
                                What is the best default transaction isolation level for an ERP, if any?
                            
                                Why is inserting into and joining #temp tables faster?
                            
                                Sql Server Decimal(30,10) losing last 2 decimals
                            
                                Convert timestamp/date time from UTC to EST Oracle SQL
                            
                                Zend DB fetchAll(): where clause array with IN operator
                            
                                SQL - Best practice for a Friendship table
                            
                                common term for create,update,delete
                            
                                SQL Injection after removing all single-quotes and dash-characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

Tags:

regex

sql

sas

Har

People also ask

4 Answers

cmjohns

shrutyzet

Draco Ater

Lukas Eder

Recent Activity

Donate For Us