Let us have a text in which we want to match all strings between double quotes; but within these double quotes, there can be quoted double quotes. Example:
"He said \"Hello\" to me for the first time"
Using regexes, how do you match this efficiently?
A very efficient solution to match such inputs is to use the normal* (special normal*)*
pattern; this name is quoted from the excellent book by Jeffrey Friedl, Mastering Regular Expressions.
It is a pattern useful in general to match inputs consisting of regular entries (the normal part) with separators inbetween (the special part).
Note that like all things regex, it should be used when there is no better choice; while one could use this pattern for parsing CSV data, for instance, if you use Java, you're better off using OpenCSV instead.
Also note that while the quantifiers in the pattern name are stars (ie, zero or more), you can vary them to suit your needs.
Let us take the above example again; and please consider that this text sample may be anywhere in your input:
"He said \"Hello\" to me for the first time"
No matter how hard you try, no amount of "dot plus greedy/lazy quantifiers" magic will help you solve it. Instead, categorize the input between quotes as normal and special:
[^\\"]
;\\"
.Substituting this into the normal* (special normal*)*
pattern, this gives the following regex:
[^\\"]*(\\"[^\\"]*)*
Adding the double quotes around to match the full text gives the final regex:
"[^\\"]*(\\"[^\\"]*)*"
You will note that this will also match empty quoted strings.
Here we will have to use a variant on the quantifiers, since:
For simplicity, we will also suppose that only lowercase, ASCII letters are allowed.
Sample input:
the-word-to-match
Let us decompose again into normal and special:
[a-z]
;-
The canonical form of the pattern would be:
[a-z]*(-[a-z]*)*
But as we said:
*
should become +
;*
should become +
.We end up with:
[a-z]+(-[a-z]+)*
Adding word anchors around it to obtain the final result:
\b[a-z]+(-[a-z]+)*\b
The examples above limit themselves to replacing *
with +
, but of course you can have as many variations as you wish. One ultra classical example would be an IP address:
\d{1,3}
),\.
),normal
appears only once, therefore no quantifier,normal
inside the (special normal*)
also appears only once, therefore no quantifier,(special normal*)
part appears exactly three times, therefore {3}
.Which gives the expresison (decorated with word anchors):
\b\d{1,3}(\.\d{1,3}){3}\b
This pattern's flexibility makes it one of the most useful tools in your regex toolbox. While many problems exist which you should not use regexes for if libraries exist, in some situations, you have to use regexes. And this will become one of your best friends once you have practiced with it a bit!
(special normal*)
part); it is therefore recommended that you use a non-capturing group. For instance, use "[^\\"]*(?:\\"[^\\"]*)*"
for quoted strings. In fact, had you wanted it, capturing would almost never lead to the desired results in this case, because repeating a capturing group will only ever give you the last capture (all previous repetitions will be overwritten), unless you are using this pattern in .NET. (thanks @ohaal)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With