Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for unclosed HTML tags

Tags:

html

regex

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:

<i><b>test<i>ing</i>

Is this too complex for regex? Might it require some recursive, programmatic processing?

like image 607
core Avatar asked Oct 19 '25 09:10

core


2 Answers

I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.

like image 129
Pesto Avatar answered Oct 22 '25 00:10

Pesto


Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.

Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi

like image 43
Karl Avatar answered Oct 21 '25 23:10

Karl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!