What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.
PyTidyLib is a nice python binding for HTML Tidy. Their example:
from tidylib import tidy_document document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''', options={'numeric-entities':1}) print document print errors
Moreover it's compatible with both legacy HTML Tidy and the new tidy-html5.
XHTML is easy, use lxml.
from lxml import etree from StringIO import StringIO etree.parse(StringIO(html), etree.HTMLParser(recover=False))
HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With