Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I fix wrongly nested / unclosed HTML tags?

Tags:

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

For example, something like

<p>
  <ul>
    <li>Foo

becomes

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

Any help would be appreciated :)

like image 687
Baishampayan Ghose Avatar asked Nov 16 '08 04:11

Baishampayan Ghose


People also ask

How do you find an unclosed tag in HTML?

In the left pane of the code view you can see there <> highlight invalid code button, click this button and you will notice the unclosed div highlighted and then close your unclosed div. Press F5 to refresh the page to see that any other unclosed div are there.

What are the unclosed tags in HTML?

There are also tags that are forbidden to be closed: img, input, br, hr, meta, etc.


2 Answers

using BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

gets you

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

using Tidy:

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

gets you

<ul>
<li>Foo</li>
</ul>

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

comes out as

<p></p>
<ul>
<li>Foo</li>
</ul>

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)

becomes

<ul>
  <li>Foo
  </li>
</ul>

All of these have their ups and downs, but hopefully one of them is close enough.

like image 155
pantsgolem Avatar answered Oct 09 '22 06:10

pantsgolem


Run it through Tidy or one of its ported libraries.

Try to code it by hand and you will want to gouge your eyes out.

like image 44
Nicholas Piasecki Avatar answered Oct 09 '22 08:10

Nicholas Piasecki