Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex Microsoft Word without destroying document formatting

Tags:

regex

ms-word

vba

It's well known that word's find and replace "wildcards" features suffer some severe limitations.

The following code implements true regex find and replace in a word document, and variations on it are found in other Stackoverflow and SuperUser questions.

Sub RegEx_PlainText(Before As String, After As String)

    Dim regexp As Object
    Set regexp = CreateObject("vbscript.regexp")            

    With regexp
        .Pattern = Before
        .IgnoreCase = True
        .Global = True

         'could be any Range , .Range.Text , or selection object
         ActiveDocument.Range = .Replace(ActiveDocument.Range, After)

    End With
End Sub

However, this wipes the document of all formatting.

Word will not preserve formatting character by character even if the strings are of the same length or indeed the same string, so ActiveDocument.Range = ActiveDocument.Range or Selection.Text=Selection.Text will wipe all formatting (or more accurately, format the whole range the same as the first character in the range, and add a carriage return). Upon reflection, this behavior isn't so surprising.

To solve this, the following code runs a regex find, then loops through the matches and runs .replace only on the range where the match is found. This then, would only lose formatting if the match iself had a variety of formatting (for example an italicised word would be lost)

Hopefully the code comments make this quite transparent.

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

   With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.text)

        For Each Match In Foundmatches

            'set matchrange to location of found string in source doc.
            'offset accounts for change in length of  document from already completed replacements
            Set MatchRange = Location.Document _
                   .Range(Match.FirstIndex + offset, _
                          Match.FirstIndex + Match.Length + offset)

            'debugging
            If DebugMode Then
                    Debug.Print "strfound      = " & Match.Value
                    Debug.Print "matchpoint    = " & Match.FirstIndex
                    Debug.Print "origstrlength = " & Match.Length
                    Debug.Print "offset        = " & offset
                    Debug.Print "matchrange    = " & MatchRange.text
                    MatchRange.Select
                Stop

            Else
            'REAL LIFE
                'run the regex replace just on the range containing the regex match
                MatchRange = .Replace(MatchRange, After)

                'increment offset to account for change in length of document
                offset = offset + MatchRange.End - MatchRange.Start - Match.Length
            End If
        Next
    End With
End Sub

This works on simple documents, but when I run it on a real document, matchrange ends up being at some point near the where the match was found, but not exactly right. It's not predictably off, sometimes it is to the right, and sometimes to the left. Generally the more complex the document. (links, tables of context, formatting etc.) the more wrong it ends up being.

Does anyone know why this doesn't work, and how to fix it? If I could understand why this isn't working, then I might be able to determine whether this approach can be fixed, or if I just need to try a different method.

Code includes DebugMode param which means it will just loop through the doc and highlight all matches, performing no changes. Also outputs a bunch of stuff to the console. This should be helpful for anyone kind enough to tackle this problem with me.

https://calibre-ebook.com/downloads/demos/demo.docx Here is a sample document (not produced by me) which may be useful.

like image 897
Some_Guy Avatar asked Jan 18 '17 14:01

Some_Guy


1 Answers

@Some_Guy: thanks for asking this question, I had a similar problem and your post saved me quite a bit of time.

This is the kludge I came up with:

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim j As Long
    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

    With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.Text)
        For j = Foundmatches.Count - 1 To 0 Step -1

            If DebugMode = True Then
                'debugging
                Debug.Print Foundmatches(j), .Replace(Foundmatches(j), After)
            Else
                'REAL LIFE

                'run a plain old find/replace on the found string and eplace strings
                With ActiveDocument.Range.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Replacement.Font.Hidden = True
                    .Text = Foundmatches(j)
                    .Replacement.Text = regexp.Replace(Foundmatches(j), After)
                    .Execute Replace:=wdReplaceAll
                End With
            End If
        Next j
    End With
End Sub

Basically I use a simple find/replace with strings that match each item found (and would be replaced) with a regex, would decent support for it exist in Word). Note that any text replaced takes on the formatting of the first replaced character, so if the first word is in bold, then all the replaced text will be bold.

like image 105
cybernetic.nomad Avatar answered Oct 13 '22 20:10

cybernetic.nomad