I am using a regex to parse some BBCode, so the regex has to work recursively to also match tags inside others. Most of the BBCode has an argument, and sometimes it's quoted, though not always.
A simplified equivalent of the regex I'm using (with html style tags to reduce the escaping needed) is this:
'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided
([^<]+ | (?R))* #Match the contents of the tag, including recursively
</a>~x'
However, if I have a test string that looks like this:
<"a">Content<a>Also Content</a></a>
it only matches the <a>Also Content</a>
because when it tries to match from the first tag, the first matching group, \1
, is set to "
, and this is not overwritten when the regex is run recursively to match the inner tag, which means that because it isn't quoted, it doesn't match and that regex fails.
If instead I consistently either use or don't use quotes, it works fine, but I can't be sure that that will be the case with the content that I have to parse. Is there any way to work around this?
The full regex that I'm using, to match [spoiler]content[/spoiler]
, [spoiler=option]content[/spoiler]
and [spoiler="option"]content[/spoiler]
, is
"~\[spoiler\s*+ #Match the opening tag
(?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that
(?:\ *(?:\n|<br />))?+ #Get rid of an extra new line before the start of the content if necessary
((?:[^\[\n]++ #Capture all characters until the closing tag
|\n(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above
|\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers
|(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers,
# without messing up nesting
\n? #Get rid of an extra new line before the closing tag if necessary
\[/spoiler] #match the closing tag
~xi"
There are a couple of other bugs with it as well though.
The simplest solution is to use alternatives instead:
<(?:a|"a")>
([^<]++ | (?R))*
</a>
But if you really don't want to repeat that a
part, you can do the following:
<("?)a\1>
([^<]++ | (?R))*
</a>
Demo
I've just put the conditional ?
inside the group. This time, the capturing group always matches, but the match can be empty, and the conditional isn't necessary anymore.
Side note: I've applied a possessive quantifier to [^<]
to avoid catastrophic backtracking.
In your case I believe it's better to match a generic tag than a specific one. Match all tags, and then decide in your code what to do with the match.
Here's a full regex:
\[
(?<tag>\w+) \s*
(?:=\s*
(?:
(?<quote>["']) (?<arg>.{0,100}?) \k<quote>
| (?<arg>[^\]]+)
)
)?
\]
(?<content>
(?:[^[]++ | (?R) )*+
)
\[/\k<tag>\]
Demo
Note that I added the J
option (PCRE_DUPNAMES
) to be able to use (?<arg>
...)
twice.
(?(1)...)
only checks if the group 1 has been defined, so the condition is true once the group is defined the first time. That is why you obtain this result (it is not related with the recursion level or whatever).
So when <a>
is reached in the recursion, the regex engine try to match <a">
and fails.
If you want to use a conditional statement, you can write <("?)a(?(1)\1)>
instead. In this way the group 1 is redefined each times.
Obviously you can write your pattern in a more efficient way like this:
~<(?:a|"a")>[^<]*+(?:(?R)[^<]*)*+</a>~
For your particular problem, I will use this kind of pattern to match any tags:
$pattern = <<<'EOD'
~
\[ (?<tag>\w+) \s*
(?:
= \s*
(?| " (?<option>[^"]*) " | ' ([^']*) ' | ([^]\s]*) ) # branch reset feature
)?
\s* ]
(?<content> [^[]*+ (?: (?R) [^[]*)*+ )
\[/\g{tag}]
~xi
EOD;
If you want to impose a specific tag at the ground level, you can add (?(R)|(?=spoiler\b))
before the tag name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With