Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex, group & quantifyer

I just did the funny regex crosswords at http://regexcrossword.com/ - and found out I don't understand what quantifying groups means, e.g. (.)+ or (.)*

Let me try at http://ole.michelsen.dk/tools/regex.html , it offers the JavaScript and the PHP regex engine:

The string to match against is "Trololo!" (without quotation marks). (If switching on "Global match" changed something, it is added as primed version, that is JS', as it didn't change anything in PHP mode.)

JS,  (.)+ => 0: Trololo! 1: ! 
JS', (.)+ => 0: Trololo! 
PHP, (.)+ => 0: Trololo! 0: ! 
JS,  (.)* => 0: Trololo! 1: ! 
JS', (.)* => 0: Trololo! 
PHP, (.)* => 0: Trololo! 1: 0: ! 1: 
JS,  (.){5} => 0: Trolo 1: o 
JS', (.){5} => 0: Trolo 
PHP, (.){5} => 0: Trolo 0: o 
JS,  (.){4} => 0: Trol 1: l 
JS', (.){4} => 0: Trol 1: olo! 
PHP, (.){4} => 0: Trol 1: olo! 0: l 1: ! 

Is there any normative answer what the semantics of this is?

like image 674
Falko Avatar asked Oct 21 '22 04:10

Falko


1 Answers

The outputs aren't labelled correctly, that's all.

First of all, what should happen? If you repeat a group, each new instance overwrites the last capture. If the group isn't used at all it will return an empty string or something like undefined in JS (it depends on the flavor). There is a good article over on regular-expressions.info on the matter.

Now how do we get to your results? Let's start with JavaScript.

All the examples labelled JS (the non-global ones) fit the above description. They match the desired amount of characters in 0 and capture the last character in 1. So we can ignore these.

What's with the global ones? Here the output was interpreted incorrectly. When you use the global flag with the String.match() function, you don't get an array of all captures any more - but only an array of all matches (group 0 for each match). Hence, in the case of +, * and {5} where there is only one match, you only get that one result. With {4} there is enough room for two matches in the target string, so the resulting array contains two elements. To get all captures with the global flag, you'd need to write a loop and use RegExp.exec() instead (which gives you one match at a time, but all its captures).

And what's with PHP? It seems that it's using preg_match_all, which is global anyway, which is why using g had no effect. The + gives the result you'd expect again. So does {5}.

What's with the other two? Here, the output has been interpreted the wrong way round. By default, preg_match_all gives a two dimensional array, where the first index corresponds to the group, and the second one corresponds to the match. In your output, it's interpreted the other way round. Hence, when there are multiple matches, the first pair of 0 and 1 are the entire match of two found matches. The second pair 0 and 1 are what you captured in those two matches.

So for *, you first get the full string as a match, and the last character as the capture (the two things labelled 0), which is correct. And then, since * allows zero-width matches, you get another (empty) match at the end of the string, along with an empty capture. I'm not sure why the corresponding JS' example does not contain an additional empty string, though, because String.match would do the same thing.

And for {4}, you just get two matches (Trol and olo!) as in the JavaScript case with the captures l and !, respectively, which is again perfectly fine.

like image 65
Martin Ender Avatar answered Nov 01 '22 02:11

Martin Ender