Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weighted Damerau-Levenshtein in VBA

I'm building a private spellchecker for the Microsoft Office suite. I'm doing string comparisons of typos and their potential fixes to determine which corrections I want included.

I've looked high and low for a weighted Damerau-Levenshtein formula for string comparison because I want swaps, insertions, deletions and replacements to all have different weights, not simply a weight of "1", so I can give preference to some corrections over others. For example, the typo "agmes" could theoretically correct to "games" or "ages", since both require just one edit to move to either correctly spelled word, but I'd like to give the "swap" edit a lower weight so that "games" would show as the preferred correction.

I'm using Excel for analysis, so any code I use needs to be in Visual Basic for Applications (VBA). The best I could find is this example, which seems great, but it's in Java. I tried my best to convert, but I'm far from an expert and could use a little help!

Can anyone take a look at the attached code and help me figure out what's wrong?

THANK YOU!

EDIT: I got it working on my own. Here's a weighted Damerau-Levenshtein formula in VBA. It uses Excel's built-in math functions for some evaluation. When comparing a typo to two possible corrections, the correction with the highest cost is the preferred word. This is because the cost of two swaps must be greater than the cost of a deletion and an insertion, and that's not possible if you assign swaps with the lowest cost (which I think is ideal). Check out Kevin's blog if you need more info.

Public Function WeightedDL(source As String, target As String) As Double

    Dim deleteCost As Double
    Dim insertCost As Double
    Dim replaceCost As Double
    Dim swapCost As Double

    deleteCost = 1
    insertCost = 1.1
    replaceCost = 1.1
    swapCost = 1.2

    Dim i As Integer
    Dim j As Integer
    Dim k As Integer

    If Len(source) = 0 Then
        WeightedDL = Len(target) * insertCost
        Exit Function
    End If

    If Len(target) = 0 Then
        WeightedDL = Len(source) * deleteCost
        Exit Function
    End If

    Dim table() As Double
    ReDim table(Len(source), Len(target))

    Dim sourceIndexByCharacter() As Variant
    ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1) As Variant

    If Left(source, 1) <> Left(target, 1) Then
        table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost))
    End If

    sourceIndexByCharacter(0, 0) = Left(source, 1)
    sourceIndexByCharacter(1, 0) = 0

    Dim deleteDistance As Double
    Dim insertDistance As Double
    Dim matchDistance As Double

    For i = 1 To Len(source) - 1

        deleteDistance = table(i - 1, 0) + deleteCost
        insertDistance = ((i + 1) * deleteCost) + insertCost

        If Mid(source, i + 1, 1) = Left(target, 1) Then
            matchDistance = (i * deleteCost) + 0
        Else
            matchDistance = (i * deleteCost) + replaceCost
        End If

        table(i, 0) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
    Next

    For j = 1 To Len(target) - 1

        deleteDistance = table(0, j - 1) + insertCost
        insertDistance = ((j + 1) * insertCost) + deleteCost

        If Left(source, 1) = Mid(target, j + 1, 1) Then
            matchDistance = (j * insertCost) + 0
        Else
            matchDistance = (j * insertCost) + replaceCost
        End If

        table(0, j) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
    Next

    For i = 1 To Len(source) - 1

        Dim maxSourceLetterMatchIndex As Integer

        If Mid(source, i + 1, 1) = Left(target, 1) Then
            maxSourceLetterMatchIndex = 0
        Else
            maxSourceLetterMatchIndex = -1
        End If

        For j = 1 To Len(target) - 1

            Dim candidateSwapIndex As Integer
            candidateSwapIndex = -1

            For k = 0 To UBound(sourceIndexByCharacter, 2)
                If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k)
            Next

            Dim jSwap As Integer
            jSwap = maxSourceLetterMatchIndex

            deleteDistance = table(i - 1, j) + deleteCost
            insertDistance = table(i, j - 1) + insertCost
            matchDistance = table(i - 1, j - 1)

            If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then
                matchDistance = matchDistance + replaceCost
            Else
                maxSourceLetterMatchIndex = j
            End If

            Dim swapDistance As Double

            If candidateSwapIndex <> -1 And jSwap <> -1 Then

                Dim iSwap As Integer
                iSwap = candidateSwapIndex

                Dim preSwapCost
                If iSwap = 0 And jSwap = 0 Then
                    preSwapCost = 0
                Else
                    preSwapCost = table(Application.Max(0, iSwap - 1), Application.Max(0, jSwap - 1))
                End If

                swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost

            Else
                swapDistance = 500
            End If

            table(i, j) = Application.Min(Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance), swapDistance)

        Next

        sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1)
        sourceIndexByCharacter(1, i) = i

    Next

    WeightedDL = table(Len(source) - 1, Len(target) - 1)

End Function
like image 788
rgmatthes Avatar asked Dec 03 '12 22:12

rgmatthes


1 Answers

I can see you've answered this yourself: I wrote a modified Levenshtein edit distance algorithm for address matching a couple of years ago (the site's now hosted in Russia and it's a bad idea to go there) but that didn't perform at all well, and a 'sum of common strings' approach was adequate for the task in hand:

Fuzzy-Matching strings in Excel using a simplified 'Edit Distance' proxy in VBA

That code probably needs re-testing and re-work.

Looking at your code, if you ever want to revisit it, here's a speed tip:

Dim arrByte() As Byte 
Dim byteChar As Byte 

arrByte = strSource

for i = LBound(arrByte) To UBound(arrByte) Step 2     byteChar = arrByte(i)     ' I'll do some comparison operations using integer arithmetic on the char Next i

String-handling in VBA is horribly slow, even if you use Mid$() instead of Mid(), but numeric operations are pretty good: and strings are actually arrays of bytes, which the compiler will accept at face value.

The 'step' of 2 in the loop is to skip over the high-order bytes in unicode strings - you're probably running your string comparison on plain-vanilla ASCII text, and you'll see that the byte array for (say) "ABCd" is (00, 65, 00, 66, 00, 67, 00, 100). Most of the Latin alphabet in Western European countries - accents, diacritics, dipthongs and all - will fit in under 255 and won't venture into the higer-order bytes that show as zeroes in that wxample.

You'll get away with it in strictly monolingual string comparisons in Hebrew, Greek, Russian and Arabic because the upper byte is constant within each alphabet: Greek "αβγδ" is the byte array (177,3,178,3,179,3,180,3). However, that's sloppy coding and it'll bite (or byte) you the moment you try string comparisons across languages. And it's never going to fly in Eastern alphabets.

like image 167
Nigel Heffernan Avatar answered Nov 06 '22 04:11

Nigel Heffernan