Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Move emails where the subject matches a particular RegEx

Tags:

regex

vba

outlook

I would like to cycle through the emails in a folder ("Inbox" of "[email protected]") and move emails where the subject matches a particular RegEx to a different folder.

Sub RegExpMoveEmailToFolderSO()
    Dim MyFolder As Outlook.Folder
    Dim MyNS As NameSpace
    Dim MyEmail As Outlook.MailItem
    Dim MyItems As Outlook.Items
    Dim CountMatches As Integer
    Dim MySubject As String
    Dim MyRegExp As RegExp
    Dim MyDestinationFolder As Outlook.Folder
           
    Set MyNS = Application.GetNamespace("MAPI")
    Set MyFolder = MyNS.Folders("[email protected]").Folders("Inbox")
    Set MyDestinationFolder = MyNS.Folders("[email protected]").Folders("Inbox")
    Set MyItems = MyFolder.Items
    Set MyRegExp = New RegExp
        
    CountMatches = 1
    MyRegExp.Pattern = "(Reg).*(Exp)"
        
    For Each Item In MyItems
        MySubject = Item.Subject
        If MyRegExp.Test(MySubject) Then
            Item.Move MyDestinationFolder
            CountMatches = CountMatches + 1
        End If
    Next
            
    MsgBox "The total number of emails moved is: " & CountMatches & "."
End Sub

This is quite slow compared to say a similar Rule in Outlook and spins up the fans on my i7 machine.

Is anything obviously inefficient about this code?

like image 613
tchakravarty Avatar asked Feb 03 '19 15:02

tchakravarty


2 Answers

I am not a Regex expert so I use a test harness to help me develop patterns. I tried matching your pattern and some variations against a number of strings that match your subjects. I had not thought of timing different patterns before but I have now added that as an option to my test harness. The results below were not as I expected.

Pattern        Text                   Duration

(Reg).*(Exp)   xxxRegyyyExpzzz        0.00000216
(Reg).*(Exp)   xxxxRegExpzzz          0.00000212
(Reg).*(Exp)   xxxxxRegyEyyExpzzz     0.00000220
(Reg).*(Exp)   xxxxxxRegyyExyExpzzz   0.00000220

Reg.*Exp       xxxRegyyyExpzzz        0.00000199
Reg.*Exp       xxxxRegExpzzz          0.00000198
Reg.*Exp       xxxxxRegyEyyExpzzz     0.00000204
Reg.*Exp       xxxxxxRegyyExyExpzzz   0.00000205

Reg.*?Exp      xxxRegyyyExpzzz        0.00000205
Reg.*?Exp      xxxxRegExpzzz          0.00000188
Reg.*?Exp      xxxxxRegyEyyExpzzz     0.00000214
Reg.*?Exp      xxxxxxRegyyExyExpzzz   0.00000220

Timing VBA routines is difficult because background interpreter and OS routines can significantly affect timings. I have to increase the number of repeats to 10,000,000 before the total duration was enough for me to consider the average duration reliable.

As you can see removing the capture brackets saves a little time although you would need thousands of emails before you would notice. Only the number of characters between "Reg" and "Exp" seem to have much effect.

I do not understand why the first two patterns work. .* is said to be greedy. It should match every character up to the end of the string or the next linefeed. The pattern should not find the "Exp" because they matched the .*. Only the lazy .*? should have stopped matching characters when it found "Exp". Either I have misunderstood greedy versus lazy matching or the VBA Regex engine does not treat .* as greedy.

My conclusion is that regex matching is not the cause of your routine being slow. I suggest you try Tim's suggestion. IAmANerd2000 added a routine demonstrating Tim's suggestion but he/she has since deleted it. (I can see deleted answers because my reputation is over 10K.) Perhaps Tim would like to add an answer demonstrating his suggestion.

I include my test harness below in case you find it helpful. Its output per pattern and text is:

===========================================
   Pattern: "(Reg).*(Exp)"
      Text: "xxxRegyyyExpzzz"
Av Durat'n: 0.00000216
-------------------------------------------
     Match: 1
     Value: "RegyyyExp"
    Length: 9
FirstIndex: 3
  SubMatch: 1 "Reg"
  SubMatch: 2 "Exp"
===========================================

Option Explicit
Sub Test2()

  Dim Patterns As Variant
  Dim Texts As Variant

  Texts = Array("xxxRegyyyExpzzz", _
                "xxxxRegExpzzz", _
                "xxxxxRegyEyyExpzzz", _
                "xxxxxxRegyyExyExpzzz")

  Patterns = Array("(Reg).*(Exp)", _
                   "Reg.*Exp", _
                   "Reg.*?Exp")

  Call TestCapture(Patterns, Texts, True)

End Sub
Sub TestCapture(ByRef Patterns As Variant, ByRef Texts As Variant, _
                Optional ByVal TimeDuration As Boolean = False)

  ' Patterns      an array of patterns to be tested
  ' Texts         an array of text to be matched against the patterns
  ' TimeDuration  if True, record the average duration of the match

  ' Attempts to match each text against each pattern and reports on the result
  ' If TimeDuration is True, repeats the match 10,000,000 times and reports the
  ' average duration so the efficiency of different patterns can be determined

  Dim CountCrnt As Long
  Dim CountMax As Long
  Dim InxM As Long
  Dim InxS As Long
  Dim Matches As MatchCollection
  Dim PatternCrnt As Variant
  Dim RegEx As New RegExp
  Dim TimeEnd As Double
  Dim TimeStart As Double
  Dim SubMatchCrnt As Variant
  Dim TextCrnt As Variant

  With RegEx
    .Global = True         ' Find all matches
    .MultiLine = False     ' Match cannot extend across linebreak
    .IgnoreCase = True

    For Each PatternCrnt In Patterns
     .Pattern = PatternCrnt

      For Each TextCrnt In Texts
        Debug.Print "==========================================="
        Debug.Print "   Pattern: """ & PatternCrnt & """"
        Debug.Print "      Text: """ & TidyTextForDspl(TextCrnt) & """"
        If TimeDuration Then
          CountMax = 10000000
          TimeStart = Timer
        Else
          CountMax = 1
        End If
        For CountCrnt = 1 To CountMax
          If Not .test(TextCrnt) Then
            Debug.Print Space(12) & "Text does not match pattern"
            Exit For
          Else
            Set Matches = .Execute(TextCrnt)
            If CountCrnt = CountMax Then
              TimeEnd = Timer
              If TimeDuration Then
                Debug.Print "Av Durat'n: " & Format((TimeEnd - TimeStart) / CountMax, "0.00000000")
              End If
              If Matches.Count = 0 Then
                Debug.Print Space(12) & "Match but no captures"
              Else
                For InxM = 0 To Matches.Count - 1
                  Debug.Print "-------------------------------------------"
                  With Matches(InxM)
                    Debug.Print "     Match: " & InxM + 1
                    Debug.Print "     Value: """ & TidyTextForDspl(.Value) & """"
                    Debug.Print "    Length: " & .Length
                    Debug.Print "FirstIndex: " & .FirstIndex
                    For InxS = 0 To .SubMatches.Count - 1
                      Debug.Print "  SubMatch: " & InxS + 1 & " """ & _
                                  TidyTextForDspl(.SubMatches(InxS)) & """"
                    Next
                  End With
                Next InxM
              End If
            End If
          End If
        Next CountCrnt
      Next TextCrnt
    Next PatternCrnt
    Debug.Print "==========================================="

  End With

End Sub
Public Function TidyTextForDspl(ByVal Text As String) As String

  ' Tidy Text for dsplay by replacing white space with visible strings:
  '   Replace spaces by          ‹s› or ‹n s›
  '   Replace line feed by       ‹lf› or ‹n lf›
  '   Replace carriage return by ‹cr› or ‹n cr›
  '   Replace tab by             ‹tb› or ‹n tb›
  '   Replace non-break space by ‹nbs› or {n nbs›
  ' Where n is a count if the character repeats

  ' 15Mar16  Coded
  '  3Feb19  Replaced "{" (\x7B) and "}" (\x7D) by "‹" (\u2039) and "›" (\u203A)
  '          on the grounds that the angle quotation marks were not likely to
  '          appear in text to be displayed.

  Dim InsStr As String
  Dim InxWsChar As Long
  Dim NumWsChar As Long
  Dim PosWsChar As Long
  Dim RetnVal As String
  Dim WsCharCrnt As Variant
  Dim WsCharValue As Variant
  Dim WsCharDspl As Variant

  WsCharValue = Array(" ", vbLf, vbCr, vbTab, Chr(160))
  WsCharDspl = Array("s", "lf", "cr", "tb", "nbs")

  RetnVal = Text
  For InxWsChar = LBound(WsCharValue) To UBound(WsCharValue)
    Do While True
      PosWsChar = InStr(1, RetnVal, WsCharValue(InxWsChar))
      If PosWsChar = 0 Then
        Exit Do
      End If
      NumWsChar = 1
      Do While Mid(RetnVal, PosWsChar + NumWsChar, 1) = WsCharValue(InxWsChar)
        NumWsChar = NumWsChar + 1
      Loop
      If NumWsChar = 1 Then
        InsStr = "‹" & WsCharDspl(InxWsChar) & "›"
      Else
        InsStr = "‹" & NumWsChar & WsCharDspl(InxWsChar) & "›"
      End If
      RetnVal = Mid(RetnVal, 1, PosWsChar - 1) & InsStr & Mid(RetnVal, PosWsChar + NumWsChar)
    Loop
  Next

  TidyTextForDspl = RetnVal

End Function
like image 79
Tony Dallimore Avatar answered Nov 09 '22 14:11

Tony Dallimore


  1. Yes, the code is extremely inefficient - you should never loop through all items in a folder. Use Items.Find/FindNext or Items.Restrict to do the job. These methods do not support RegEx, but (if you really need to use RegEx) you should at least use these methods to filter down your potential matches.

See Microsoft docs for the query format and examples.

  1. Also note that you are using "for each" loop as you are modifying the very same collection (by calling Move) - that will cause you to skip some items. Always use a down loop from Items.Count down to 1 step -1 (where the Items is returned preferably returned by Items.Restrict - see #1).
like image 2
Dmitry Streblechenko Avatar answered Nov 09 '22 14:11

Dmitry Streblechenko