Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Possessive generic quantifier {m,n}+ not implemented in Ruby 1.9.3?

Possessive quantifiers are greedy and refuse backtrack. A regex /.{1,3}+b/ should mean: Match any character except line breaks, 1 to 3 times, as many as possible and don't backtrack. Tthen match the character b.

In this example:

'ab'.sub /.{1,3}+b/, 'c'    #=> "c"

no substitution should take place, contrary to fact.

The result in these two examples differs:

'aab'.sub /.{0,1}+b/, 'c'   #=> "c"
'aab'.sub /.?+b/, 'c'       #=> "ac"

Compare this with Scala, where they give the same answer:

scala> ".{0,1}+b".r.replaceAllIn("aab", "c")
res1: String = ac
scala> ".?+b".r.replaceAllIn("aab", "c")
res2: String = ac

Is this a Ruby bug, or is it possible to motivate this behavior? Perhaps, Oniguruma for some reason implemented possessive with all quantifiers ?, *, + except the generic quantifier {m,n}? If that's the case, why?

like image 457
Staffan Nöteberg Avatar asked Mar 20 '13 09:03

Staffan Nöteberg


2 Answers

What really happens

It seems that + followed by range quantifier doesn't offer the possessive property to the range quantifier. Rather, it is treated as whatever in front repeated once or more. Using .{1,3}+b as an example, it will be equivalent to (?:.{1,3})+b.

Work-around

You can work-around this with the more general construct non-backtracking group (or atomic grouping) (?>pattern). Let us use the general case pattern{n,m}+ as an example to construct the equivalent regex with non-backtracking group (equivalent to Java's behavior when matching with pattern{n,m}+):

(?>(?>pattern){n,m})

Why 2 levels of non-backtracking groups? 2 are necessary because:

  • When a match is found for pattern (one instance of repetition), backtracking within pattern is disallowed. (Note that as long as an instance has not been found, backtracking within pattern is allowed). This is emulated with the inner non-backtrakcing group.
  • When no more instances of pattern can be found, backtracking to remove any of the instances is disallowed. This is emulated with the outer non-backtracking group.

I am not sure if there is any caveat here. Please ping me with comment if you found any case not emulated with this method.

Testing

Test 1

At first, I tested this regex:

(.{1,3}+)b

Initially, I tested without the capturing group, but the result was so surprising that I need the capturing group to confirm what is happening.

On this input:

2343333ab

The result is that the whole string matches, and the capturing group captured 2343333a (without the b at the end). This shows that the upper limit has somehow been broken.

DEMO at rubular

Test 2

This second test reveals how range quantifiers {n}'s behavior cannot be modified to be possessive, and it is likely that this also applies for other range quantifiers {n,} and {n,m}. Instead, the following + will only exhibit repetition of 1 or more time behavior.

(My initial conclusion is that + overwrites the upper limit, but it turns out to be wrong).

Testing regex:

(.{3}+)b

Input string:

23d4344333ab
234344333ab
23434433ab

The matches captured in capturing group 1 are all multiples of 3. From top to bottom, the regex skips 2, 1, 0 characters respectively for the input strings.

Input string with annotation ([] denotes the match for the whole regex, () denotes the text captured by capturing group 1):

23[(d4344333a)b]
2[(34344333a)b]
[(23434433a)b]

DEMO at rubular

Testing code for work around

This is the testing code in Java to show that both outer and inner non-backtracking groups are necessary. ideone

class TestPossessive {
  public static void main(String args[]) {
    String inputText = "123456789012";
    System.out.println("Input string: " + inputText);
    System.out.println("Expected: " + inputText.replaceFirst("(?:\\d{3,4}(?![89])){2,}+", ">$0<"));
    System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>(?:\\d{3,4}(?![89])){2,})", ">$0<"));
    System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>\\d{3,4}(?![89])){2,}", ">$0<"));
    System.out.println("Both: " + inputText.replaceFirst("(?>(?>\\d{3,4}(?![89])){2,})", ">$0<"));

    System.out.println();

    inputText = "aab";
    System.out.println("Input string: " + inputText);
    System.out.println("Expected: " + inputText.replaceFirst(".{1,3}+b", ">$0<"));
    System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>.{1,3})b", ">$0<"));
    System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>.){1,3}b", ">$0<"));
    System.out.println("Both: " + inputText.replaceFirst("(?>(?>.){1,3})b", ">$0<"));
  }
}
like image 93
nhahtdh Avatar answered Oct 31 '22 20:10

nhahtdh


It seems like this is intended in Oniguruma. Documentation says {n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only. I guess this is because of backward compatibility reasons, or?

like image 33
Staffan Nöteberg Avatar answered Oct 31 '22 20:10

Staffan Nöteberg