Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validate (X)HTML in Python

What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.

like image 748
cdleary Avatar asked Aug 30 '08 01:08

cdleary


2 Answers

PyTidyLib is a nice python binding for HTML Tidy. Their example:

from tidylib import tidy_document document, errors = tidy_document('''<p>f&otilde;o <img src="bar.jpg">''',     options={'numeric-entities':1}) print document print errors 

Moreover it's compatible with both legacy HTML Tidy and the new tidy-html5.

like image 68
Dave Brondsema Avatar answered Sep 22 '22 23:09

Dave Brondsema


XHTML is easy, use lxml.

from lxml import etree from StringIO import StringIO etree.parse(StringIO(html), etree.HTMLParser(recover=False)) 

HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.

like image 44
John Millikin Avatar answered Sep 20 '22 23:09

John Millikin