Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are ways to match street addresses in SQL Server?

Tags:

sql

sql-server

We have a column for street addresses:

123 Maple Rd.
321 1st Ave.
etc...

Is there any way to match these addresses to a given input? The input would be a street address, but it might not be in the same format. For example:

123 Maple Road
321 1st Avenue

Our first thought is to strip the input of all street terms (rd, st, ave, blvd, etc).

Obviously that won't match reliably all the time. Are there any other ways to try to match street addresses in SQL server?

We can use user defined functions, stored procs and regular old t-sql. We cannot use clr.

like image 951
dtc Avatar asked Jan 19 '10 21:01

dtc


2 Answers

Rather than stripping out the things that can be variable, try to convert them to a "canonical form" that can be compared.

For example, replace 'rd' or 'rd.' with 'road' and 'st' or 'st.' with 'street' before comparing.

like image 74
Mark Byers Avatar answered Oct 13 '22 12:10

Mark Byers


You may want to consider using the Levenshtein Distance algorithm.

You can create it as a user-defined function in SQL Server, where it will return the number of operations that need to be performed on String_A so that it becomes String_B. You can then compare the result of the Levenshtein Distance function against some fixed threshold, or against some value derived from the length of the strings.

You would simply use it as follows:

... WHERE LEVENSHTEIN(address_in_db, address_to_search) < 5;

As Mark Byers suggested, converting variable terms into canonical form will help if you use Levenshtein Distance.

Using Full-Text Search may be another option, especially since Levenshtein would normally require a full table scan. This decision may depend on how frequently you intend to do these queries.

You may want to check out the following Levenshtein Distance implementation for SQL Server:

  • Levenshtein Distance Algorithm: TSQL Implementation

Note: You would need to implement a MIN3 function for the above implementation. You can use the following:

CREATE FUNCTION MIN3(@a int, @b int,  @c int)
RETURNS int
AS
BEGIN
    DECLARE @m INT
    SET @m = @a

    IF @b < @m SET @m = @b
    IF @c < @m SET @m = @c

    RETURN @m
END

You may also be interested in checking out the following articles:

  • Address Geocoding with Fuzzy String Matching [Uses Levenshtein Distance]
  • Stack Overflow - Strategies for finding duplicate mailing addresses
  • Merge/Purge and Duplicate Detection
like image 32
Daniel Vassallo Avatar answered Oct 13 '22 11:10

Daniel Vassallo