We have a SQL Server table containing Company Name, Address, and Contact name (among others).
We regularly receive data files from outside sources that require us to match up against this table. Unfortunately, the data is slightly different since it is coming from a completely different system. For example, we have "123 E. Main St." and we receive "123 East Main Street". Another example, we have "Acme, LLC" and the file contains "Acme Inc.". Another is, we have "Ed Smith" and they have "Edward Smith"
We have a legacy system that utilizes some rather intricate and CPU intensive methods for handling these matches. Some involve pure SQL and others involve VBA code in an Access database. The current system is good but not perfect and is cumbersome and difficult to maintain
The management here wants to expand its use. The developers who will inherit the support of the system want to replace it with a more agile solution that requires less maintenance.
Is there a commonly accepted way for dealing with this kind of data matching?
Data Comparison means The procedure of checking, undertaken by the Bank or the Counterpart Bank to determine whether the Established Baseline of a particular Trade Transaction matches with the Data Set of the Trade Transaction by way of submitting to TSU the Data Set of the Trade Transaction based upon the Shipping ...
dbForge Schema Compare for SQL Server is a robust database comparison tool that saves you time and effort when catching the differences between SQL Server databases. The tool helps quickly generate SQL merge script to resolve differences.
Comparisons transform raw data from something abstract to something relatable. They enable us to judge the value and relevance of a number. Data gives information about the world, but it needs a comparison framework to be useful.
Here's something I wrote for a nearly identical stack (we needed to standardize the manufacturer names for hardware and there were all sorts of variations). This is client side though (VB.Net to be exact) -- and use the Levenshtein distance algorithm (modified for better results):
Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
Dim bestMatch As String = ""
Dim bestDistance As Integer = 1000 'Almost anything should be better than that!
For Each matchCandidate As String In stringList
Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
If candidateDistance < bestDistance Then
bestMatch = matchCandidate
bestDistance = candidateDistance
End If
Next
Return bestMatch
End Function
'This will be used to determine how similar strings are. Modified from the link below...
'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim sLength As Integer = s.Length ' length of s
Dim tLength As Integer = t.Length ' length of t
Dim lvCost As Integer ' cost
Dim lvDistance As Integer = 0
Dim zeroCostCount As Integer = 0
Try
' Step 1
If tLength = 0 Then
Return sLength
ElseIf sLength = 0 Then
Return tLength
End If
Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}
' fill first row
For lvIndex As Integer = 0 To sLength
poBuffer(lvIndex) = lvIndex
Next
'fill first column
For lvIndex As Integer = 1 To tLength
poBuffer(lvIndex * (sLength + 1)) = lvIndex
Next
For lvRowIndex As Integer = 0 To sLength - 1
Dim s_i As Char = s(lvRowIndex)
For lvColIndex As Integer = 0 To tLength - 1
If s_i = t(lvColIndex) Then
lvCost = 0
zeroCostCount += 1
Else
lvCost = 1
End If
' Step 6
Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
Next
Next
Catch ex As ThreadAbortException
Err.Clear()
Catch ex As Exception
WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
End Try
Return lvDistance - zeroCostCount
End Function
SSIS (in Sql 2005+ Enterprise) has Fuzzy Lookup which is designed for just such data cleansing issues.
Other than that, I only know of domain specific solutions - such as address cleaning, or general string matching techniques.
There are many vendors out there that offer products to do this kind of pattern matching. I would do some research and find a good, well-reputed product and scrap the home-grown system.
As you say, your product is only good, and this is a common-enough need for businesses that I'm sure there's more than one excellent product out there. Even if it costs a few thousand bucks for a license, it will still be cheaper than paying a bunch of developers to work on something in-house.
Also, the fact that the phrases "intricate", "CPU intensive", "VBA code" and "Access database" appear together in your system's description is another reason to find a good third-party tool.
EDIT: it's also possible that .NET has a built-in component that does this kind of thing, in which case you wouldn't have to pay for it. I still get surprised once in a while by the tools that .NET offers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With