How to parse broken HTML with LXML

I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7

Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work:

from lxml import etree import StringIO broken_html = "<html><head><title>test<body><h1>page title</h3>" parser = etree.HTMLParser() tree   = etree.parse(StringIO.StringIO(broken_html))

Result:

Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220)   File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482)   File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)   File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)   File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)   File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)   File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)   File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is the difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

What is lxml parser?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Don't just construct that parser, use it (as per the example you link to):

>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser) >>> tree <lxml.etree._ElementTree object at 0x2fd8e60>

Or use lxml.html as a shortcut:

>>> from lxml import html >>> broken_html = "<html><head><title>test<body><h1>page title</h3>" >>> html.fromstring(broken_html) <Element html at 0x2dde650>

lxml allows you load a broken xml by creating a parser instance with recover=True

etree.HTMLParser(recover=True)

You could use the same technique when creating the parser.

How to parse broken HTML with LXML

Tags:

diemacht

People also ask

2 Answers

Fred Foo

Jerome Anthony

Recent Activity

Donate For Us

How to parse broken HTML with LXML

Tags:

diemacht

People also ask

2 Answers

Fred Foo

Jerome Anthony

Related questions

Recent Activity

Donate For Us