I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7
Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work:
from lxml import etree import StringIO broken_html = "<html><head><title>test<body><h1>page title</h3>" parser = etree.HTMLParser() tree = etree.parse(StringIO.StringIO(broken_html))
Result:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220) File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482) File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764) File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562) File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232) File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488) File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379) File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
Don't just construct that parser, use it (as per the example you link to):
>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser) >>> tree <lxml.etree._ElementTree object at 0x2fd8e60>
Or use lxml.html
as a shortcut:
>>> from lxml import html >>> broken_html = "<html><head><title>test<body><h1>page title</h3>" >>> html.fromstring(broken_html) <Element html at 0x2dde650>
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.HTMLParser(recover=True)
You could use the same technique when creating the parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With