Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly remove unnecessary whitespace from a (very large) string

I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.

One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.

I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace:

Function MyTrim(s As String) As String
    Do While InStr(s, "  ") > 0
        s = Replace$(s, "  ", " ")
    Loop
    MyTrim = Trim$(s)
End Function

I tried this method and it was "worked", but was painfully slow:

Len In:  44930886 
Len Out: 35322469
Runtime: 247.6 seconds

Is there a faster way to remove whitespace from a "very large" string?

like image 716
ashleedawg Avatar asked Feb 10 '18 13:02

ashleedawg


2 Answers

I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.

A Regex replace has a good chance of that.

Option Explicit

Sub Test(ByVal text As String)

  Static Regex As Object
  If Regex Is Nothing Then
    Set Regex = CreateObject("VBScript.RegExp")
    Regex.Global = True
    Regex.MultiLine = True
  End If

  Regex.Pattern = " +" ' space, one or more times

  Dim result As String: result = Regex.Replace(text, " ")
  Debug.Print Len(result), Left(result, 20)
End Sub

With an input string of 45 million characters takes about a second.

Runner:

Sub Main()

  Const ForReading As Integer = 1
  Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
  Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
  Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
  Dim text As String: text = file.ReadAll()
  Set file = Nothing
  Set fso = Nothing
  Debug.Print Len(text), Left(text, 20)

  Test (text)

End Sub

Test data creator (C#):

var substring = "××\n× ××   ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);

BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32). (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)

like image 102
Tom Blodget Avatar answered Nov 15 '22 08:11

Tom Blodget


In VBA, the size of a String is limited to approximately 2 Billion Characters. The "Replace-Loop" method above took 247 seconds for a 45 Million character string, which is over 4 minutes.

Theoretically, that means a 2 Billion character string would take at least 3 hours — if it even finished without crashing — so it's not exactly practical.

Excel has a built-in worksheet function Trim which is not the same as VBA's Trim function.

Worksheet function Trim removes all spaces from text except for single spaces between words.

The problem is that Trim, like all functions called with Application.WorksheetFunction, has a size limit of 32,767 characters, and this [unfortunately] applies even when calling the function from VBA with a string that's not even in a cell.

However, we can still use the function if we use it to loop through our "gigantic string" in sections, like this:

EDIT: Don't even bother with this crap (my function, below)! See the RegEx answer above.

Function bigTrim(strIn As String) As String

    Const maxLen = 32766
    Dim loops As Long, x As Long
    loops = Int(Len(strIn) / maxLen)
    If (Len(strIn) / maxLen) <> loops Then loops = loops + 1

    For x = 1 To loops
        bigTrim = bigTrim & _
            Application.WorksheetFunction.Trim(Mid(strIn, _
            ((x - 1) * maxLen) + 1, maxLen))
    Next x

End Function

Running this function on the same string that was used with the "Replace-Loop" method yielded much better results:

Len In:  44930886 
Len Out: 35321845
Runtime: 33.6 seconds

That's more than 7x faster than the "Replace-Loop" method, and managed to remove 624 spaces that were somehow missed by the other method.

(I though about looking into why the first method missed characters, but since I know my string isn't missing anything, and the point of this exercise was to save time, that would be silly!)

like image 22
ashleedawg Avatar answered Nov 15 '22 06:11

ashleedawg