Lets say I have 9 rows of records. Each 3 rows have the same value. For instance:
Mike
Mike
Mike
John
John
John
Ryan
Ryan
Ryan
Is there a way I can search for similarities of these records? For example spelling mistakes, additional characters, missing characters, etc. So, for example, the correct version is Mike
, but there might be a record down in the list with value Mke
which is incorrect (spelling mistake). How can I find this and replace it with the correct one?
The above example is obviously simplified. I actually have ~1mln rows. Right now to achieve the 'grouping' of the elements I just sort them alphabetically.
I was facing the exact same problem! With a few searches I could get and modify the following VBA code that will enable a function named =Similarity()
. This function will output a number that goes from 0 to 1, according to the similarity of the two input cells.
I ordered alphabetically my column info and applied the formula. Then I created a Conditional Formatting Rule
to highlight the ones with a high similarity rate (i.e.: at least 65%). Then I searched for each highlighted occurrence and fixed my records manually.
Usage:
=Similarity(cell1, cell2)
Obs.: The similarity indicator goes from 0 to 1 (0% to 100%)
To use it, you must:
Code:
Public Function Similarity(ByVal String1 As String, _
ByVal String2 As String, _
Optional ByRef RetMatch As String, _
Optional min_match = 1) As Single
Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long
If UCase(String1) = UCase(String2) Then
Similarity = 1
Else:
lngLen1 = Len(String1)
lngLen2 = Len(String2)
If (lngLen1 = 0) Or (lngLen2 = 0) Then
Similarity = 0
Else:
b1() = StrConv(UCase(String1), vbFromUnicode)
b2() = StrConv(UCase(String2), vbFromUnicode)
lngResult = Similarity_sub(0, lngLen1 - 1, _
0, lngLen2 - 1, _
b1, b2, _
String1, _
RetMatch, _
min_match)
Erase b1
Erase b2
If lngLen1 >= lngLen2 Then
Similarity = lngResult / lngLen1
Else
Similarity = lngResult / lngLen2
End If
End If
End If
End Function
Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
ByVal start2 As Long, ByVal end2 As Long, _
ByRef b1() As Byte, ByRef b2() As Byte, _
ByVal FirstString As String, _
ByRef RetMatch As String, _
ByVal min_match As Long, _
Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity *(RECURSIVE)
Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim I As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String
If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
Exit Function '(exit if start/end is out of string, or length is too short)
End If
For lngCurr1 = start1 To end1
For lngCurr2 = start2 To end2
I = 0
Do Until b1(lngCurr1 + I) <> b2(lngCurr2 + I)
I = I + 1
If I > lngLongestMatch Then
lngMatchAt1 = lngCurr1
lngMatchAt2 = lngCurr2
lngLongestMatch = I
End If
If (lngCurr1 + I) > end1 Or (lngCurr2 + I) > end2 Then Exit Do
Loop
Next lngCurr2
Next lngCurr1
If lngLongestMatch < min_match Then Exit Function
lngLocalLongestMatch = lngLongestMatch
RetMatch = ""
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
RetMatch = RetMatch & strRetMatch1 & "*"
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
, "*", "")
End If
RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)
lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)
If strRetMatch2 <> "" Then
RetMatch = RetMatch & "*" & strRetMatch2
Else
RetMatch = RetMatch & IIf(recur_level = 0 _
And lngLocalLongestMatch > 0 _
And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
, "*", "")
End If
Similarity_sub = lngLongestMatch
End Function
Text similarity can get quite complicated, depending on just how far you want to go. A complete survey of all the different algorithms can be found in this paper A Survey of Text Similarity Approaches (Gomaa & Fahmy, IJCA 2013). It can hurt your head, but it's good stuff.
Specific to VBA you can refer to this previous answer on SO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With