Non-greedy string regular expression matching

Tags:

I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")                                         
     [,1]   
[1,] "aaaab"

Base functions behave the same way:

> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

I would expect the match to be just ab as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

Could someone please explain me what's going on?

Update. What's crazy is that in some other cases non-greedy patterns behave as expected:

> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
     [,1]                                          
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
     [,1]              
[1,] "<a href=\"abc\">"

995

asked May 16 '13 01:05

Victor K.

2 Answers

Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.

Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab, aaab, aab, and ab are matches to your pattern, but aaaab being the one that starts the most to the left is the one that is returned.

So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:

str_match('xxx aaaab yyy', "a.*?y") 
#      [,1]     
# [1,] "aaaab y"

Here all of the strings aaaab y, aaaab yy, aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.

So what can you do to catch that last ab? Use this:

str_match('xxx aaaab yyy', ".*(a.*b)")
#      [,1]        [,2]
# [1,] "xxx aaaab" "ab"

How does it work? By adding a greedy pattern .* in the front, you are now forcing the process to put the last possible a into the captured group.

113

answered Oct 19 '22 21:10

flodel

The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.

See the examples:

> library(stringr)
> str_extract('xxx aaaab yyy', "a[^ab]*b")
[1] "ab"
> str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
[1] "xxx aaa xxx aaa zzz"
> str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
[1] "xxx aaa zzz"

The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.

Scenario 1. Boundaries are single characters

In case a and b are single characters, the shortest window is found by using a negated character class. a[^ab]*b will easily grab the substring from a till the next b with no as and bs in between.

Scenario 2. Boundaries are not single characters

You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz pattern matches xxx, then any 0+ chars other than a linebreak char that is not the starting char of a xxx or zzz char sequence (the (?!xxx|zzz) is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz.

These matching scenarios can be easily used with base R regmatches (using a PCRE regex flavor that supports lookaheads):

> x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
> unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
[1] "xxx aaa zzz" "xxx ccc zzz"

One note: when using a PCRE regex in base R, or the ICU regex in str_extract/str_match, the . does not match linebreak characters, to enable that behavior, you need to add (?s) at the pattern start (an inline DOTALL modifier).

answered Oct 19 '22 20:10

Wiktor Stribiżew

Related questions
                            
                                checked="checked" not working in chrome
                            
                                JavaFX 2 TableView : different cell factory depending on the data inside the cell
                            
                                Installing only one target (and its dependencies) out of a complex project with cmake (open to better solutions)
                            
                                Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 32140
                            
                                Use Angular Directive attributes in its template
                            
                                Lazy loading of class attributes
                            
                                Why the SQL Server ignore the empty space at the end automatically?
                            
                                How can I change log level of single logger in runtime? [closed]
                            
                                Jquery on image error not working on dynamic images?
                            
                                set property 'system.windows.resourcedictionary.deferrablecontent' threw an exception
                            
                                Pycharm environment different than command line
                            
                                Finding non-numeric data in a data frame or vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With