Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way I can speed up this VBA algorithm?

Tags:

excel

vba

trie

I am looking to implement a VBA trie-building algorithm that is able to process a substantial English lexicon (~50,000 words) in a relatively short amount of time (less than 15-20 seconds). Since I am a C++ programmer by practice (and this is my first time doing any substantial VBA work), I built a quick proof-of-concept program that was able to complete the task on my computer in about half a second. When it came time to test the VBA port however, it took almost two minutes to do the same -- an unacceptably long amount of time for my purposes. The VBA code is below:

Node Class Module:

Public letter As String
Public next_nodes As New Collection
Public is_word As Boolean

Main Module:

Dim tree As Node

Sub build_trie()
    Set tree = New Node
    Dim file, a, b, c As Integer
    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file
    Do While Not EOF(file)
        Dim line As String
        Line Input #file, line
        wordlist.add line
    Loop
    For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wordlist.Item(a))
            Dim match As Boolean
            match = False
            Dim char As String
            char = Mid(wordlist.Item(a), b, 1)
            For c = 1 To current.next_nodes.Count
                If char = current.next_nodes.Item(c).letter Then
                    Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next c
            If Not match Then
                Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next a
End Sub

My question then is simply, can this algorithm be sped up? I saw from some sources that VBA Collections are not as efficient as Dictionarys and so I attempted a Dictionary-based implementation instead but it took an equal amount of time to complete with even worse memory usage (500+ MB of RAM used by Excel on my computer). As I say I am extremely new to VBA so my knowledge of both its syntax as well as its overall features/limitations is very limited -- which is why I don't believe that this algorithm is as efficient as it could possibly be; any tips/suggestions would be greatly appreciated.

Thanks in advance

NB: The lexicon file referred to by the code, "corncob_caps.txt", is available here (download the "all CAPS" file)

like image 238
GRB Avatar asked Oct 07 '11 21:10

GRB


1 Answers

There are a number of small issues and a few larger opportunities here. You did say this is your first vba work, so forgive me if I'm telling you things you already know

Small things first:
Dim file, a, b, c As Integer declares file, a and b as variants. Integer is 16 bit sign, so there may be risk of overflows, use Long instead.

DIM'ing inside loops is counter-productive: unlike C++ they are not loop scoped.

The real opportunity is:

Use For Each where you can to iterate collections: its faster than indexing.

On my hardware your original code ran in about 160s. This code in about 2.5s (both plus time to load word file into the collection, about 4s)

Sub build_trie()
    Dim t1 As Long
    Dim wd As Variant
    Dim nd As Node

    Set tree = New Node
    ' Dim file, a, b, c As Integer  : declares file, a, b as variant
    Dim file As Integer, a As Long, b As Long, c As Long     ' Integer is 16 bit signed

    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file

    ' no point in doing inside loop, they are not scoped to the loop
    Dim line As String
    Dim match As Boolean
    Dim char As String
    Dim new_node As Node

    Do While Not EOF(file)
        'Dim line As String
        Line Input #file, line
        wordlist.Add line
    Loop


    t1 = GetTickCount
    For Each wd In wordlist ' for each is faster
    'For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wd)
            'Dim match As Boolean
            match = False
            'Dim char As String
            char = Mid$(wd, b, 1)
            For Each nd In current.next_nodes
            'For c = 1 To current.next_nodes.Count
                If char = nd.letter Then
                'If char = current.next_nodes.Item(c).letter Then
                    Set current = nd
                    'Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next nd
            If Not match Then
                'Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.Add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next wd

    Debug.Print "Time = " & GetTickCount - t1 & " ms"
End Sub

EDIT:

loading the word list into a dynamic array will reduce load time to sub second. Be aware that Redim Preserve is expensive, so do it in chunks

    Dim i As Long, sz As Long
    sz = 10000
    Dim wordlist() As String
    ReDim wordlist(0 To sz)

    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file

    i = 0
    Do While Not EOF(file)
        'Dim line As String
        Line Input #file, line
        wordlist(i) = line
        i = i + 1
        If i > sz Then
            sz = sz + 10000
            ReDim Preserve wordlist(0 To sz)
        End If
        'wordlist.Add line
    Loop
    ReDim Preserve wordlist(0 To i - 1)

then loop through it like

    For i = 0 To UBound(wordlist)
        wd = wordlist(i)
like image 92
chris neilsen Avatar answered Sep 28 '22 00:09

chris neilsen