Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does #(\w+)=([\'"])(.*)\\2#U mean?

Tags:

regex

I am a noob to regex.

I am hoping someone can explain what's going on in #(\w+)=([\'"])(.*)\2#U.

preg_match_all('#(\w+)=([\'"])(.*)\\2#U', $str, $matches);

Thanks in advance.

like image 555
shin Avatar asked Dec 17 '22 19:12

shin


2 Answers

Let's break it apart, piece by piece. To begin with, note that preg_match_all takes delimiters around its regex, so the #s don't match anything, but the U is important: it's a modifier which makes the match "ungreedy". This means that rather than matching as much as possible, all of the quantifiers (?, *, +, {,}) will match as little as possible. Then, piece by piece:

  1. (\w+): The \w matches a "word character"—something alphanumeric or an underscore; the + matches one or more of these; and the parentheses group it and store it in the first capturing group, which can be accessed with \1.
  2. =: Match a literal =. Very simple :)
  3. ([\'"]): The square brackets introduce a character class, which is a shorthand way to say "match any of these characters". Here, the character class is ['"], but since it's a single-quoted string, the ' has to be escaped. Thus, this matches either a ' or a ", and stores the result in the second capturing group, which can be accessed with \2. This is the only relevant capturing group for this particular regex.
  4. (.*): The . matches any non-newline character, and * matches any number (zero or more) of them. This is why the U modifier is important! Without it, this would match all the way to the end of the line, always; with it, it will match until the next thing matches. Note that, since it's in parentheses, it's in the third capturing group, which can be accessed with \3 (shocking).
  5. \\2: If we didn't have to escape the backslash, this would just be \2: the contents of the second capturing group. In this case, it's whichever quote we matched back in step 3.

Putting that all together, this regex matches, roughly speaking, a variable name (step 1) followed by an equals sign (step 2) followed by a string (steps 3-5); the reason for the \2 is so that the regex won't match "string', and the reason for the U modifier is so that foo="string" bar="strung" will return the two matches of foo="string" and bar="strung" (with \1 being foo and bar, and \3 being string and strung), rather than the single, greedy match of foo="string" bar="strung" (with \1 being foo and \3 being string" bar="strung). Some examples are

foo_bar_123="John's applesauce."
100='seventeen'
banana_split=""
_="This is a normal string"

These entities can be scattered throughout the string, on the same line or on different lines, within surrounding text or not, just as long as each entity is itself on one line. Note further that no spaces are allowed, so foo = "bar" won't match.

like image 126
Antal Spector-Zabusky Avatar answered Jan 04 '23 23:01

Antal Spector-Zabusky


You're matching strings of the form:

foo='bar'

or

baz="blat"

(\w+) matches one or more word characters. (Word characters are a to z, A to Z, and underscore.)

= matches a literal equal sign.

[\'"] matches a single or double-quote.

(.*) matches any sequence of characters zero or more times.

\2 is an escaped \2, which in regexp matches the second match. In this case your second match is either a single or double-quote. Using \2 ensures that the quotes are matched and you can use the other style of quote in the string.

like image 35
James Kovacs Avatar answered Jan 05 '23 00:01

James Kovacs