How can I use regular expressions in Excel and take advantage of Excel's powerful grid-like setup for data manipulation?
I understand Regex is not ideal for many situations (To use or not to use regular expressions?) since excel can use Left
, Mid
, Right
, Instr
type commands for similar manipulations.
Regex allows you to do advanced sorting and filtering. The SeoTools plugin for Excel supports regex, but it — like most cool resources for Excel — is PC-swim only.
To start using RegEx in VBA, you need to select and activate the RegEx object reference library from the references list. Start by going to the VBA IDE, select Tools –> References, then select Microsoft VBScript Regular Expressions 5.5, then click OK.
Regular expressions are used for Pattern Matching.
To use in Excel follow these steps:
Step 1: Add VBA reference to "Microsoft VBScript Regular Expressions 5.5"
Step 2: Define your pattern
Basic definitions:
-
Range.
a-z
matches an lower case letters from a to z0-5
matches any number from 0 to 5[]
Match exactly one of the objects inside these brackets.
[a]
matches the letter a[abc]
matches a single letter which can be a, b or c[a-z]
matches any single lower case letter of the alphabet.()
Groups different matches for return purposes. See examples below.
{}
Multiplier for repeated copies of pattern defined before it.
[a]{2}
matches two consecutive lower case letter a: aa
[a]{1,3}
matches at least one and up to three lower case letter a
, aa
, aaa
+
Match at least one, or more, of the pattern defined before it.
a+
will match consecutive a's a
, aa
, aaa
, and so on?
Match zero or one of the pattern defined before it.
[a-z]?
matches empty string or any single lower case letter.*
Match zero or more of the pattern defined before it.
[a-z]*
matches empty string or string of lower case letters..
Matches any character except newline \n
a.
Matches a two character string starting with a and ending with anything except \n
|
OR operator
a|b
means either a
or b
can be matched.red|white|orange
matches exactly one of the colors.^
NOT operator
[^0-9]
character can not contain a number[^aA]
character can not be lower case a
or upper case A
\
Escapes special character that follows (overrides above behavior)
\.
, \\
, \(
, \?
, \$
, \^
Anchoring Patterns:
^
Match must occur at start of string
^a
First character must be lower case letter a
^[0-9]
First character must be a number.$
Match must occur at end of string
a$
Last character must be lower case letter a
Precedence table:
Order Name Representation 1 Parentheses ( ) 2 Multipliers ? + * {m,n} {m, n}? 3 Sequence & Anchors abc ^ $ 4 Alternation |
Predefined Character Abbreviations:
abr same as meaning \d [0-9] Any single digit \D [^0-9] Any single character that's not a digit \w [a-zA-Z0-9_] Any word character \W [^a-zA-Z0-9_] Any non-word character \s [ \r\t\n\f] Any space character \S [^ \r\t\n\f] Any non-space character \n [\n] New line
Example 1: Run as macro
The following example macro looks at the value in cell A1
to see if the first 1 or 2 characters are digits. If so, they are removed and the rest of the string is displayed. If not, then a box appears telling you that no match is found. Cell A1
values of 12abc
will return abc
, value of 1abc
will return abc
, value of abc123
will return "Not Matched" because the digits were not at the start of the string.
Private Sub simpleRegex() Dim strPattern As String: strPattern = "^[0-9]{1,2}" Dim strReplace As String: strReplace = "" Dim regEx As New RegExp Dim strInput As String Dim Myrange As Range Set Myrange = ActiveSheet.Range("A1") If strPattern <> "" Then strInput = Myrange.Value With regEx .Global = True .MultiLine = True .IgnoreCase = False .Pattern = strPattern End With If regEx.Test(strInput) Then MsgBox (regEx.Replace(strInput, strReplace)) Else MsgBox ("Not matched") End If End If End Sub
Example 2: Run as an in-cell function
This example is the same as example 1 but is setup to run as an in-cell function. To use, change the code to this:
Function simpleCellRegex(Myrange As Range) As String Dim regEx As New RegExp Dim strPattern As String Dim strInput As String Dim strReplace As String Dim strOutput As String strPattern = "^[0-9]{1,3}" If strPattern <> "" Then strInput = Myrange.Value strReplace = "" With regEx .Global = True .MultiLine = True .IgnoreCase = False .Pattern = strPattern End With If regEx.test(strInput) Then simpleCellRegex = regEx.Replace(strInput, strReplace) Else simpleCellRegex = "Not matched" End If End If End Function
Place your strings ("12abc") in cell A1
. Enter this formula =simpleCellRegex(A1)
in cell B1
and the result will be "abc".
Example 3: Loop Through Range
This example is the same as example 1 but loops through a range of cells.
Private Sub simpleRegex() Dim strPattern As String: strPattern = "^[0-9]{1,2}" Dim strReplace As String: strReplace = "" Dim regEx As New RegExp Dim strInput As String Dim Myrange As Range Set Myrange = ActiveSheet.Range("A1:A5") For Each cell In Myrange If strPattern <> "" Then strInput = cell.Value With regEx .Global = True .MultiLine = True .IgnoreCase = False .Pattern = strPattern End With If regEx.Test(strInput) Then MsgBox (regEx.Replace(strInput, strReplace)) Else MsgBox ("Not matched") End If End If Next End Sub
Example 4: Splitting apart different patterns
This example loops through a range (A1
, A2
& A3
) and looks for a string starting with three digits followed by a single alpha character and then 4 numeric digits. The output splits apart the pattern matches into adjacent cells by using the ()
. $1
represents the first pattern matched within the first set of ()
.
Private Sub splitUpRegexPattern() Dim regEx As New RegExp Dim strPattern As String Dim strInput As String Dim Myrange As Range Set Myrange = ActiveSheet.Range("A1:A3") For Each C In Myrange strPattern = "(^[0-9]{3})([a-zA-Z])([0-9]{4})" If strPattern <> "" Then strInput = C.Value With regEx .Global = True .MultiLine = True .IgnoreCase = False .Pattern = strPattern End With If regEx.test(strInput) Then C.Offset(0, 1) = regEx.Replace(strInput, "$1") C.Offset(0, 2) = regEx.Replace(strInput, "$2") C.Offset(0, 3) = regEx.Replace(strInput, "$3") Else C.Offset(0, 1) = "(Not matched)" End If End If Next End Sub
Results:
Additional Pattern Examples
String Regex Pattern Explanation a1aaa [a-zA-Z][0-9][a-zA-Z]{3} Single alpha, single digit, three alpha characters a1aaa [a-zA-Z]?[0-9][a-zA-Z]{3} May or may not have preceding alpha character a1aaa [a-zA-Z][0-9][a-zA-Z]{0,3} Single alpha, single digit, 0 to 3 alpha characters a1aaa [a-zA-Z][0-9][a-zA-Z]* Single alpha, single digit, followed by any number of alpha characters </i8> \<\/[a-zA-Z][0-9]\> Exact non-word character except any single alpha followed by any single digit
To make use of regular expressions directly in Excel formulas the following UDF (user defined function) can be of help. It more or less directly exposes regular expression functionality as an excel function.
It takes 2-3 parameters.
$0
, $1
, $2
, and so on. $0
is the entire match, $1
and up correspond to the respective match groups in the regular expression. Defaults to $0
.Extracting an email address:
=regex("Peter Gordon: [email protected], 47", "\w+@\w+\.\w+") =regex("Peter Gordon: [email protected], 47", "\w+@\w+\.\w+", "$0")
Results in: [email protected]
Extracting several substrings:
=regex("Peter Gordon: [email protected], 47", "^(.+): (.+), (\d+)$", "E-Mail: $2, Name: $1")
Results in: E-Mail: [email protected], Name: Peter Gordon
To take apart a combined string in a single cell into its components in multiple cells:
=regex("Peter Gordon: [email protected], 47", "^(.+): (.+), (\d+)$", "$" & 1) =regex("Peter Gordon: [email protected], 47", "^(.+): (.+), (\d+)$", "$" & 2)
Results in: Peter Gordon
[email protected]
...
To use this UDF do the following (roughly based on this Microsoft page. They have some good additional info there!):
ALT+F11
to open the Microsoft Visual Basic for Applications Editor.Click on Insert Module. If you give your module a different name make sure the Module does not have the same name as the UDF below (e.g. naming the Module Regex
and the function regex
causes #NAME! errors).
In the big text window in the middle insert the following:
Function regex(strInput As String, matchPattern As String, Optional ByVal outputPattern As String = "$0") As Variant Dim inputRegexObj As New VBScript_RegExp_55.RegExp, outputRegexObj As New VBScript_RegExp_55.RegExp, outReplaceRegexObj As New VBScript_RegExp_55.RegExp Dim inputMatches As Object, replaceMatches As Object, replaceMatch As Object Dim replaceNumber As Integer With inputRegexObj .Global = True .MultiLine = True .IgnoreCase = False .Pattern = matchPattern End With With outputRegexObj .Global = True .MultiLine = True .IgnoreCase = False .Pattern = "\$(\d+)" End With With outReplaceRegexObj .Global = True .MultiLine = True .IgnoreCase = False End With Set inputMatches = inputRegexObj.Execute(strInput) If inputMatches.Count = 0 Then regex = False Else Set replaceMatches = outputRegexObj.Execute(outputPattern) For Each replaceMatch In replaceMatches replaceNumber = replaceMatch.SubMatches(0) outReplaceRegexObj.Pattern = "\$" & replaceNumber If replaceNumber = 0 Then outputPattern = outReplaceRegexObj.Replace(outputPattern, inputMatches(0).Value) Else If replaceNumber > inputMatches(0).SubMatches.Count Then 'regex = "A to high $ tag found. Largest allowed is $" & inputMatches(0).SubMatches.Count & "." regex = CVErr(xlErrValue) Exit Function Else outputPattern = outReplaceRegexObj.Replace(outputPattern, inputMatches(0).SubMatches(replaceNumber - 1)) End If End If Next regex = outputPattern End If End Function
Save and close the Microsoft Visual Basic for Applications Editor window.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With