Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't a non-greedy quantifier sometimes work in Oracle regex?

IMO, this query should return A=1,B=2,

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,') as A_and_B FROM dual 

But it returns the whole string, A=1,B=2,C=3,, instead. Why?


Update 1:

Oracle 10.2+ is required to use Perl-style metacharacters in regular expressions.

Update 2:

A more clear form of my question (to avoid questions about Oracle version and availability of Perl-style regex extension):

On the same system, why does a non-greedy quantifier sometimes work as expected and sometimes not?

This works correctly:

regexp_substr('A=1,B=2,C=3,', 'B=.*?,') 

This doesn't work:

regexp_substr('A=1,B=2,C=3,', '.*B=.*?,') 

Fiddle

Update 3:

Yes, it seems to be a bug.

What is the Oracle support reaction on this issue?

Is the bug already known? Does it have an ID?

like image 867
Egor Skriptunoff Avatar asked May 22 '13 22:05

Egor Skriptunoff


2 Answers

It's a BUG!

You are right that in Perl, 'A=1,B=2,C=3,' =~ /.*B=.*?,/; print $& prints A=1,B=2,

What you have stumbled upon is a bug that still exists in Oracle Database 11g R2. If the exact same regular expression atom (including the quantifier but excluding the greediness modifier) appears twice in a regular expression, both occurrences will have the greediness indicated by the first appearance regardless of the greediness specified by the second one. That this is a bug is clearly demonstrated by these results (here, "the exact same regular expression atom" is [^B]*):

SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^Bx]*?,') as good FROM dual;  GOOD -------- A=1,B=2,  SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^B]*?,') as bad FROM dual;  BAD ----------- A=1,B=2,C=3, 

The only difference between the two regular expressions is that the "good" one excludes 'x' as a possible match in the second matching list. Since 'x' does not appear in the target string, excluding it should make no difference, but as you can see, removing the 'x' makes a big difference. That has to be a bug.

Here are some more examples from Oracle 11.2: (SQL Fiddle with even more examples)

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,')  FROM dual; =>  A=1,B=2,C=3, SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*,')   FROM dual; =>  A=1,B=2,C=3, SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*?,') FROM dual; =>  A=1,B=2, SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*,')  FROM dual; =>  A=1,B=2, -- Changing second operator from * to + SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+?,')  FROM dual; =>  A=1,B=2, SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+,')   FROM dual; =>  A=1,B=2,C=3, SELECT regexp_substr('A=1,B=2,C=3,', '.+B=.+,')   FROM dual; =>  A=1,B=2,C=3, SELECT regexp_substr('A=1,B=2,C=3,', '.+?B=.+,')  FROM dual; =>  A=1,B=2, 

The pattern is consistent: the greediness of the first occurrence is used for the second occurrence whether it should be or not.

like image 143
Old Pro Avatar answered Oct 23 '22 10:10

Old Pro


Looking at the feedback, I hesitate to jump in, but here I go ;-)

According to the Oracle docs, the *? and +? match a "preceding subexpression". For *? specifically:

Matches zero or more occurrences of the preceding subexpression (nongreedyFootref 1). Matches the empty string whenever possible.

To create a subexpression group, use parenthesis ():

Treats the expression within the parentheses as a unit. The expression can be a string or a complex expression containing operators.

You can refer to a subexpression in a back reference.

This will allow you to use greedy and non-greedy (many alternating times actually) in the same regexp, with expected results. For your example:

select regexp_substr('A=1,B=2,C=3,', '(.)*B=(.)*?,') from dual; 

To make the point a bit more clear (i hope), this example uses greedy and non-greedy in the same regexp_substr, with different (correct) results depending on where the ? is placed (it does NOT just use the rule for the first subexpression it sees). Also note that the subexpression (\w) will match alphanumerics and underscore only, not @.

-- non-greedy followed by greedy  select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*?@(\w)*') from dual; 

result: 1_@_2_a_3_

-- greedy followed by non-greedy select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*@(\w)*?') from dual; 

result: 1_@

like image 20
tbone Avatar answered Oct 23 '22 08:10

tbone