Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to turn every member of an array alphanumeric?

Final final results:

I was wondering whether the results below changed if the string was longer. I ran exactly the same tests on the same computer, except each cell had a random string of 34 characters rather than four. These were the results:

Comintern (Regexp):       136.1  ms  
brettdj (Regexp):         139.9  ms  
Slai (Regexp):            158.4  ms  
*Original Regex:          161.0  ms*    
Comintern (AN):           170.1  ms  
Comintern (Hash):         183.6  ms  
ThunderFrame:             232.9  ms    
*Original replace:        372.9  ms*  
*Original InStr:          478.1  ms*  
CallumDA33:              1218.1 ms

This really shows the speed of the Regex - all the solutions utilising Regex.replace are significantly faster, with the best being Comintern's implementation.

In summary, if the strings are long, use arrays, if they are short, use the clipboard. If unsure, the optimal result is to use arrays, but this may sacrifice a little performance on short strings.

Final results:

Thanks very much for all of your suggestions, clearly I still have a lot to learn. I was thinking about this all yesterday, so I decided to rerun everything at home. Here are the final results, based on applying each of these to 30,000 four character strings.

My computer at home is an Intel i7 @ 3.6 GHz, 8GB RAM, 64-bit Windows 10 and Excel 2016. Similar conditions to before in that I have processes running in the background, but I'm not actively doing anything throughout the tests.

Original replace:  97.67  ms
Original InStr:    106.54 ms
Original Regex:    113.46 ms
ThunderFrame:      82.21  ms
Comintern (AN):    96.98  ms
Comintern (OR):    81.87  ms
Comintern (Hash):  101.18 ms
brettdj:           81.66  ms
CallumDA33:        201.64 ms
Slai:              68.38  ms

I've therefore accepted Slai's answer as it is clearly the fastest for general implementation, but I'll rerun them all at work against the actual data to check this still works.


Original post:

I have an array in Excel that is a list of part numbers. I need to turn every member of the array alphanumeric, for example

ABC123-001 -> ABC123001
ABC123/001 -> ABC123001
ABC123001  -> ABC123001

What is the fastest way of doing this?

For context, our part numbers can come in different forms, so I'm writing a function that finds the best match within a given range. At the moment, the part of the function that makes everything alphanumeric takes about 50ms to run, whereas the rest of the function takes around 30ms in total. I also can't avoid using Excel.

I've done some work myself (see answer below), but the main issue is that I have to loop through every element of the array one-by-one - could there be a better way? I've also never run tests before, so any feedback on improving them would be much appreciated.

Here is what I've tried so far.

I'm using MicroTimer and my computer has an Intel i5 @2.5GHz, 4GB of RAM, 64-bit Windows 7. I've got processes running in the background, but I'm not actively doing anything else whilst these are run.

I created 30,000 lines of random symbols using this code:

=CHAR(RANDBETWEEN(1,60))&CHAR(RANDBETWEEN(48,57))&CHAR(RANDBETWEEN(37,140))&CHAR(RANDBETWEEN(37,140))

(note how we stop the first character at 60 because '=' is char(61) and we want to avoid Excel interpreting this as a formula. Also we force the second character to be a number so we can guarantee at least one alphanumeric character in there.)

1. Using a loop based on cases. Average time: 175ms

Using the function in this post, we load the range into an array, apply the function to each element of the array and paste it back. Code:

Function AlphaNumericOnly(strSource As Variant) As String
    Dim i As Integer
    Dim strResult As String

    For i = 1 To Len(strSource)
        Select Case Asc(Mid(strSource, i, 1))
            Case 48 To 57, 65 To 90, 97 To 122: 'include 32 if you want to include space
                strResult = strResult & Mid(strSource, i, 1)
        End Select
    Next
    AlphaNumericOnly = strResult
End Function

Sub Replace()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("Replace")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim arr As Variant
        arr = inputRng

    Dim i As Integer
        For i = LBound(arr) To UBound(arr)
            arr(i, 1) = AlphaNumericOnly(arr(i, 1))
        Next i

    outputRng = arr

    time2 = MicroTimer

    Debug.Print (time2 - time1) * 1000

End Sub

2. Using InStr() to check each character. Average time: 201ms

Define a string of valid values. Check one-by-one if the valid values appear in the array elements:

Sub InStr()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("InStr")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim arr As Variant
        arr = inputRng

    Dim validValues As String
        validValues = "01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 'put numbers and capitals at the start as they are more likely'

    Dim i As Integer, j As Integer
    Dim result As String

        For i = LBound(arr) To UBound(arr)
        result = vbNullString
            For j = 1 To Len(arr(i, 1))
                If InStr(validValues, Mid(arr(i, 1), j, 1)) <> 0 Then
                    result = result & Mid(arr(i, 1), j, 1)
                End If
            Next j
        arr(i, 1) = result
        Next i

    outputRng = arr

    time2 = MicroTimer

    Debug.Print (time2 - time1) * 1000

End Sub

3. Using regex.Replace on the array. Time: 171ms

Define a regex and use this to replace each element of the array.

Sub Regex()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("Regex")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim arr As Variant
        arr = inputRng

    Dim objRegex As Object
        Set objRegex = CreateObject("vbscript.regexp")
        With objRegex
            .Global = True
            .ignorecase = True
            .Pattern = "[^\w]"
        End With

    Dim i As Integer
        For i = LBound(arr) To UBound(arr)
            arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString)
        Next i

    outputRng = arr

    time2 = MicroTimer

    Debug.Print (time2 - time1) * 1000

End Sub

Edit:

@ThunderFrame - our part numbers generally come in the following formats:

  • All numbers (e.g. 32523452)
  • Mix of letters and numbers (e.g. AB324K234 or 123H45645)
  • Mix of letters and numbers, each linked by a non-alphanumeric character (e.g. ABC001-001, ABC001/001, 123/4557-121)

I have thought about using regex.test on each string before launching into the replacement, but I'm not sure if this will just copy the string to then test it, in which case I may as well just make the replacement to start with.

@Slai - thanks for the link - I will look into that in more detail

like image 330
Jonathan Avatar asked Dec 02 '16 10:12

Jonathan


1 Answers

It is not true that Regex has to be the winner. My second solution below is faster than even early-bound Regex! And my first solution is as fast as late-bound Regex. BOTH ARE NATIVE VBA ONLY.

Interesting question. The Original InStr method should be much faster than the results shown in the OP's question.

Its poor performance is due to string concatenation, which VBA is not good at. The longer the strings the worse it gets.

My version of the InStr method below does not use concatenation at all. It is many times faster than the original. In fact, its speed of execution matches late-bound Regex. This InStr version is completely native to VBA and very, very fast. And the longer the source strings, the faster it gets, relative to concatenation.

This method also gains a few ticks of performance by utilizing the ($) version of string functions instead of the variant version. InStrB is slightly faster than InStr. And using temporary string variables t and arx saves a good chunk of time as well.

Sub InStr_ExcelHero()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("InStr")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim i&, j&, p&, max&, arx$, t$, res$, arr
        arr = inputRng
        max = Len(arr(1, 1))

    Dim validVals$: validVals = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

        For i = LBound(arr) To UBound(arr)
            p = 0
            arx = arr(i, 1)
            res = Space$(max)
            For j = 1 To max
                t = Mid$(arx, j, 1)
                If InStrB(validVals, t) Then
                    p = p + 1
                    Mid$(res, p, 1) = t
                End If
            Next
            arr(i, 1) = Left$(res, p)
        Next

    outputRng = arr

    time2 = MicroTimer
    Debug.Print (time2 - time1) * 1000

End Sub

And the ArrayLookup version below is more than twice as fast as InStr_ExcelHero().

In fact, the method below is actually faster than early-bound Regex!

This is native VBA. No dependencies. Faster than Regex. The following method is likely the quickest way to turn every element of an array to alphanumeric... when directed from VBA... other than a custom c++ dll:

Sub ArrayLookup_ExcelHero()

    Const VALS$ = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("InStr")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim i&, j&, p&, max&, t&, arx() As Byte, res() As Byte, arr
        arr = inputRng
        max = Len(arr(1, 1))

    Dim Keep&(0 To 255)
        For i = 1 To Len(VALS)
            Keep(Asc(Mid$(VALS, i, 1))) = 1
        Next

        For i = LBound(arr) To UBound(arr)
            p = 0
            ReDim res(0 To max)
            arx = StrConv(arr(i, 1), vbFromUnicode)
            For j = 0 To max - 1
                t = arx(j)
                If Keep(t) Then
                    res(p) = t
                    p = p + 1
                End If
            Next
            arr(i, 1) = StrConv(res, vbUnicode)
        Next

    outputRng = arr

    time2 = MicroTimer
    Debug.Print (time2 - time1) * 1000

End Sub
like image 191
Excel Hero Avatar answered Sep 22 '22 08:09

Excel Hero