I need to find badly formatted HTML content from some text; we let users add strong
and em
tags but they don't always close them correctly
This is some <b>correct</b> formatting
This is some <b>incorrect<b> formatting
I would like to catch instances where the formatting is incorrect, ie where an opening tag is not followed by a closing tag. I started using negative lookaheads but have had not much success so far
<b>(?!.*?<\/b>.*?)<b>
<b>
Get opening tag(?!
negative lookahead for
.*?
anything, but not greedily<\/b>
the closing tag.*?
anything, but not greedily)
closing the lookahead<b>
Another opening tagAny idea how I could do that?
Addendum: I know about Tony the pony, but I feel it is not coming right now. This problem could be replaced by "I want to find two occurences of a word "zoinx" where there is no occurence of the word "palantir" in between" which is not HTML-related
<b>(?:(?!<\/b>).)*<b>
Try this.See demo.
https://regex101.com/r/nS2lT4/19
For a generalized version use
<([^>]*)>(?:(?!<\/\1>).)*<\1>
See demo.
https://regex101.com/r/nS2lT4/24
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With