Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex only capturing last instance of capture group in match

I have the following regular expression in two different languages that produces the same odd results (javaScript and Flash). What I want to know is not how to fix it, but why the behavior is occurring?

The Regular Expression:

\[(\\{2}|\\\]|[^\]])*\]

The goal here is to match a bracketed string, and ensure that I don't stop at an escaped bracket.

If I have the text input [abcdefg] it is correctly matched, but the only thing returned as part of the capture group is g, where as I expect abcdefg. If I change the expression to \[((?:\\{2}|\\\]|[^\]])*)\], then I get the result that I want.

So why is this happening? Will this be consistent across other languages?

note: Simplifing the expression to \[([^\]])*\] produces the same issue.

like image 407
Daniel Gimenez Avatar asked Jun 30 '13 18:06

Daniel Gimenez


2 Answers

Regardless of the problem, ActionScript and JavaScript should always yield the same results, as they both implement ECMAScript (or a superset thereof, but for regular expressions they should not disagree).

But yes, this will be happening in any language (or rather any regex flavor). The reason is that you are repeating the capturing group. Let's take a simpler example: match (.)* against abc. So what we are repeating is (.). The first time it is tried, the engine enters the group, matches a with ., leaves the group and captures a. Only now does the quantifier kick in and it repeats the whole thing. So we enter the group again, and match and capture b. This capture overwrites the previous one, hence \1 does now contain b. Same again for the third repetition: the capture will be overwritten with with c.

I don't know of a regex flavor that behaves differently, and the only one that lets you access all previous captures (instead of just overwriting them) is .NET.

The solution is the one p.s.w.g. proposed. Make the grouping you need for the repetition non-capturing (this will improve performance, because you don't need all that capturing and overwriting anyway) and wrap the whole thing in a new group. Your expression has one little flaw though: you need to include include the backslash in the negated character class. Otherwise, backtracking could give you a match in [abc\]. So here is an expression that will work as you expect:

\[((?:\\{2}|\\\]|[^\]\\])*)\]

Working demo. (unfortunately, it doesn't show the captures, but it shows that it gives correct matches in all cases)

Note that your expression does not allow for other escape sequences. In particular a single \, followed by anything but a ] will cause your pattern to fail. If this is not what you desire, you can just use:

\[((?:\\.|[^\]\\])*)\]

Working demo.

Performance can further be improved with the "unrolling-the-loop" technique:

\[([^\]\\]*(?:\\.[^\]\\]*)*)\]

Working demo.

like image 104
Martin Ender Avatar answered Nov 10 '22 00:11

Martin Ender


Try including the * quantifier inside the capture group, like this:

\[((?:\\{2}|\\\]|[^\]])*)\]
like image 39
p.s.w.g Avatar answered Nov 10 '22 01:11

p.s.w.g