Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.
Say I have this string:
some text <tag link="fo>o"> other text
I want to match the whole tag but if I use <[^>]+>
it only matches <tag link="fo>
.
How can I make sure that >
inside of quotes can be ignored.
I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.
<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>
http://regex101.com/r/yX5xS8
I know this regex might be a headache to look at, so here is my explanation:
< # Open HTML tags
[^>]*? # Lazy Negated character class for closing HTML tag
(?: # Open Outside Non-Capture group
(?: # Open Inside Non-Capture group
('|") # Capture group for quotes, backreference group 1
[^'"]*? # Lazy Negated character class for quotes
\1 # Backreference 1
) # Close Inside Non-Capture group
[^>]*? # Lazy Negated character class for closing HTML tag
)* # Close Outside Non-Capture group
> # Close HTML tags
This is a slight improvement on Vasili Syrakis answer. It handles "…"
and '…'
completely separately, and does not use the *?
qualifier.
<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>
http://regex101.com/r/jO1oQ1
< # start of HTML tag
[^'">]* # any non-single, non-double quote or greater than
( # outer group
( # inner group
"[^"]*" # "..."
| # or
'[^']*' # '...'
) #
[^'">]* # any non-single, non-double quote or greater than
)* # zero or more of outer group
> # end of HTML tag
This version is slightly better than Vasilis's in that single quotes are allowed inside "…"
, and double quotes are allowed inside '…'
, and that a (incorrect) tag like <a href='>
will not be matched.
It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace (
with (?:
, in all places. (Just using (
makes the regex shorter, and a little bit more readable).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With