I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:
> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")
[,1]
[1,] "aaaab"
Base functions behave the same way:
> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
I would expect the match to be just ab
as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:
By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)
Could someone please explain me what's going on?
Update. What's crazy is that in some other cases non-greedy patterns behave as expected:
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
[,1]
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
[,1]
[1,] "<a href=\"abc\">"
It means the greedy quantifiers will match their preceding elements as much as possible to return to the biggest match possible. On the other hand, the non-greedy quantifiers will match as little as possible to return the smallest match possible. non-greedy quantifiers are the opposite of greedy ones.
To make the quantifier non-greedy you simply follow it with a '?' the first 3 characters and then the following 'ab' is matched. greedy by appending a '?' symbol to them: *?, +?, ??, {n,m}?, and {n,}?.
The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.
To use non-greedy Perl-style regular expressions, the ? (question mark) may be added to the syntax, usually where the wildcard expression is used. In our above example, our wildcard character is the . * (period and asterisk). The period will match any character except a null (hex 00) or new line.
Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.
Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab
, aaab
, aab
, and ab
are matches to your pattern, but aaaab
being the one that starts the most to the left is the one that is returned.
So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:
str_match('xxx aaaab yyy', "a.*?y")
# [,1]
# [1,] "aaaab y"
Here all of the strings aaaab y
, aaaab yy
, aaaab yyy
matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.
So what can you do to catch that last ab
? Use this:
str_match('xxx aaaab yyy', ".*(a.*b)")
# [,1] [,2]
# [1,] "xxx aaaab" "ab"
How does it work? By adding a greedy pattern .*
in the front, you are now forcing the process to put the last possible a
into the captured group.
The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.
See the examples:
> library(stringr)
> str_extract('xxx aaaab yyy', "a[^ab]*b")
[1] "ab"
> str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
[1] "xxx aaa xxx aaa zzz"
> str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
[1] "xxx aaa zzz"
The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.
Scenario 1. Boundaries are single characters
In case a
and b
are single characters, the shortest window is found by using a negated character class. a[^ab]*b
will easily grab the substring from a
till the next b
with no a
s and b
s in between.
Scenario 2. Boundaries are not single characters
You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz
pattern matches xxx
, then any 0+ chars other than a linebreak char that is not the starting char of a xxx
or zzz
char sequence (the (?!xxx|zzz)
is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz
.
These matching scenarios can be easily used with base R regmatches
(using a PCRE regex flavor that supports lookaheads):
> x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
> unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
[1] "xxx aaa zzz" "xxx ccc zzz"
One note: when using a PCRE regex in base R, or the ICU regex in str_extract
/str_match
, the .
does not match linebreak characters, to enable that behavior, you need to add (?s)
at the pattern start (an inline DOTALL modifier).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With