Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex that i don't understand

Tags:

java

regex

I'm trying to understand this regex, can you help me out?

(?s)\\{\\{wotd\\|(.+?)\\|(.+?)\\|([^#\\|]+).*?\\}\\}
  • I don't really understand the meaning of DOTALL : (?s)
  • why the double \\ before }?
  • what does this exactly mean : (.+?) (should we read this like : the ., then + acting on the ., then ? responding to the result of .+ ?
like image 753
Paul Avatar asked Jan 08 '12 15:01

Paul


Video Answer


1 Answers

This regex is from a string. The "canonical" regex is:

(?s)\{\{wotd\|(.+?)\|(.+?)\|([^#\|]+).*?\}\}

The DOTALL modifier means that the dot can also match a newline character, but so can complemented character classes, at least with Java: ie [^a] will match each and every character which is not a, newline included. Some regex engines do NOT match a newline in complemented character classes though (this can be regarded as a bug).

The +? and *? are lazy quantifiers (which should generally be avoided). It means that they will have to look forward before each character they want to swallow to see if this character can satisfy the next component of a regex.

The fact that { and } are preceded with \ is because {...} is the repetition quantifier {n,m} where n and m are integers.

Also, it is useless to escape the pipe | in the character class [^#\|], it can be simply written as [^#|].

And finally, .*? at the end seems to swallow the rest of the fields. A better alternative is to use the normal* (special normal*)* pattern, where normal is [^|}] and special is \|.

Here is the regex without using lazy quantifiers, the "fixed" character class and the modified end. Note that the DOTALL modifier has disappeared as well, since the dot isn't used anymore:

\{\{wotd\|([^|]+)\|([^|]+)\|([^#|]+)[^|}]*(?:\|[^|}]*)*\}\}

Step by step:

\{\{         # literal "{{", followed by
wotd         # literal "wotd", followed by
\|           # literal "|", followed by
([^|]+)      # one or more characters which are not a "|" (captured), followed by
\|           # literal "|", followed by
([^|]+)      # one or more characters which are not a "|" (captured), followed by
\|           # literal "|", followed by
([^#|]+)     # one or more characters which are not "|" or "#", followed by
[^|}]*       # zero or more characters which are not "|" or "}", followed by
(?:          # begin group
  \|         # a literal "|", followed by
  [^|}]*     # zero or more characters which are not "|" or "}"
)            # end group
*            # zero or more times, followed by
\}\}         # literal "}}"
like image 89
fge Avatar answered Sep 28 '22 23:09

fge