Fuzzy matching using T-SQL

Tags:

I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex, lastname and firstname where addressindex is a unique address drilled down to the door of the apartment. So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most likely duplicates.

I need a way to list these duplicates.

tabledata:  personid     1 firstname    "Carl" lastname     "Anderson" addressindex 1  personid     2 firstname    "Carl Peter" lastname     "Anderson" addressindex 1

I know how do this if I were to match exactly on all columns but I need fuzzy match to do the trick with (from the above example) a result like:

Row     personid      addressindex     lastname     firstname 1       2             1                Anderson     Carl Peter 2       1             1                Anderson     Carl .....

Any hints on how to solve this in a good way?

319

asked May 28 '09 16:05

Frederik

2 Answers

I've found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky. I've had really good luck with my own CLR functions using the Levenshtein distance algorithm and some weighting. Using that algorithm, I've then made a UDF called GetSimilarityScore that takes two strings and returns a score between 0.0 and 1.0. The closer to 1.0 the match is, the better. Then, query with a threshold of >=0.8 or so to get the most likely matches. Something like this:

if object_id('tempdb..#similar') is not null drop table #similar select a.id, (     select top 1 x.id    from MyTable x    where x.id <> a.id    order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc ) as MostSimilarId into #similar from MyTable a  select *, dbo.GetSimilarityScore(a.MyField, c.MyField) from MyTable a join #similar b on a.id = b.id join MyTable c on b.MostSimilarId = c.id

Just don't do it with really large tables. It's a slow process.

Here's the CLR UDFs:

''' <summary> ''' Compute the distance between two strings. ''' </summary> ''' <param name="s1">The first of the two strings.</param> ''' <param name="s2">The second of the two strings.</param> ''' <returns>The Levenshtein cost.</returns> <Microsoft.SqlServer.Server.SqlFunction()> _ Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32     If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null     Dim s1 As String = string1.Value     Dim s2 As String = string2.Value      Dim n As Integer = s1.Length     Dim m As Integer = s2.Length     Dim d As Integer(,) = New Integer(n, m) {}      ' Step 1     If n = 0 Then Return m     If m = 0 Then Return n      ' Step 2     For i As Integer = 0 To n         d(i, 0) = i     Next      For j As Integer = 0 To m         d(0, j) = j     Next      ' Step 3     For i As Integer = 1 To n         'Step 4         For j As Integer = 1 To m             ' Step 5             Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)              ' Step 6             d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)         Next     Next     ' Step 7     Return d(n, m) End Function  ''' <summary> ''' Returns a score between 0.0-1.0 indicating how closely two strings match.  1.0 is a 100% ''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings. ''' </summary> <Microsoft.SqlServer.Server.SqlFunction()> _ Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble     If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null      Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)     Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)     If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too      Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2)      Dim letterS1 As String = GetLetterSimilarityString(s1)     Dim letterS2 As String = GetLetterSimilarityString(s2)     Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2)      'Dim wordS1 As String = GetWordSimilarityString(s1)     'Dim wordS2 As String = GetWordSimilarityString(s2)     'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2)      If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F     If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F      ' Return weighted result     Return (flatLevScore * 0.2F) + (letterScore * 0.8F) End Function  Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double     Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)     Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)     If maxLen = 0 Then Return 1.0F     Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen) End Function  ''' <summary> ''' Sorts all the alpha numeric characters in the string in alphabetical order ''' and removes everything else. ''' </summary> Private Shared Function GetLetterSimilarityString(s1 As String) As String     Dim allChars = If(s1, "").ToUpper().ToCharArray()     Array.Sort(allChars)     Dim result As New StringBuilder()     For Each ch As Char In allChars         If Char.IsLetterOrDigit(ch) Then             result.Append(ch)         End If     Next     Return result.ToString() End Function  ''' <summary> ''' Removes all non-alpha numeric characters and then sorts ''' the words in alphabetical order. ''' </summary> Private Shared Function GetWordSimilarityString(s1 As String) As String     Dim words As New List(Of String)()     Dim curWord As StringBuilder = Nothing     For Each ch As Char In If(s1, "").ToUpper()         If Char.IsLetterOrDigit(ch) Then             If curWord Is Nothing Then                 curWord = New StringBuilder()             End If             curWord.Append(ch)         Else             If curWord IsNot Nothing Then                 words.Add(curWord.ToString())                 curWord = Nothing             End If         End If     Next     If curWord IsNot Nothing Then         words.Add(curWord.ToString())     End If      words.Sort(StringComparer.OrdinalIgnoreCase)     Return String.Join(" ", words.ToArray()) End Function

158

answered Sep 23 '22 10:09

mattmc3

In addition to the other good info here, you might want to consider using the Double Metaphone phonetic algorithm which is generally considered to be better than SOUNDEX.

Tim Pfeiffer details an implementation in SQL in his article Double Metaphone Sounds Great Convert the C++ Double Metaphone algorithm to T-SQL (originally in SQL Mag & then in SQL Server Pro).

That will assist in matching names with slight misspellings, e.g., Carl vs. Karl.

Update: The actual downloadable code seems to be gone, but here's an implementation found on a github repo that appears to have cloned the original code

answered Sep 19 '22 10:09

D'Arcy Rittich

Related questions
                            
                                How to drop all stored procedures at once in SQL Server database?
                            
                                When is it better to store flags as a bitmask rather than using an associative table?
                            
                                Is it possible to use Full Text Search (FTS) with LINQ?
                            
                                What's the difference between a Primary Key and Identity?
                            
                                Read committed Snapshot VS Snapshot Isolation Level
                            
                                Should every SQL Server foreign key have a matching index?
                            
                                Create parameterized VIEW in SQL Server 2008
                            
                                When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?
                            
                                How can I include null values in a MIN or MAX?
                            
                                How to generate Entity Relationship (ER) Diagram of a database using Microsoft SQL Server Management Studio?
                            
                                Preferred Method of Storing Passwords In Database
                            
                                Can't see the triggers that I created in SQL Server Management Studio 2008
                            
                                How to find column names for all tables in all databases in SQL Server
                            
                                SQL Server Invalid Column name after adding new column
                            
                                Comparing results with today's date?
                            
                                How to solve SQL Server Error 1222 i.e Unlock a SQL Server table
                            
                                How to update rows with a random date
                            
                                Use float or decimal for accounting application dollar amount?
                            
                                How do I find the data directory for a SQL Server instance?
                            
                                How can I convert ticks to a date format?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy matching using T-SQL

Tags:

sql-server

tsql

fuzzy-search

Frederik

People also ask

2 Answers

mattmc3

D'Arcy Rittich

Recent Activity

Donate For Us