Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression: finding two elements not surrounding another element in text

Tags:

regex

I need to find badly formatted HTML content from some text; we let users add strong and em tags but they don't always close them correctly

This is some <b>correct</b> formatting
This is some <b>incorrect<b> formatting

I would like to catch instances where the formatting is incorrect, ie where an opening tag is not followed by a closing tag. I started using negative lookaheads but have had not much success so far

<b>(?!.*?<\/b>.*?)<b>
  • <b> Get opening tag
  • (?! negative lookahead for
    • .*? anything, but not greedily
    • <\/b> the closing tag
    • .*? anything, but not greedily
  • ) closing the lookahead
  • <b> Another opening tag

Any idea how I could do that?

Addendum: I know about Tony the pony, but I feel it is not coming right now. This problem could be replaced by "I want to find two occurences of a word "zoinx" where there is no occurence of the word "palantir" in between" which is not HTML-related

like image 483
samy Avatar asked Oct 19 '22 10:10

samy


1 Answers

<b>(?:(?!<\/b>).)*<b>

Try this.See demo.

https://regex101.com/r/nS2lT4/19

For a generalized version use

<([^>]*)>(?:(?!<\/\1>).)*<\1>

See demo.

https://regex101.com/r/nS2lT4/24

like image 140
vks Avatar answered Oct 21 '22 23:10

vks