Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse broken HTML with LXML

Tags:

I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7

Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work:

from lxml import etree import StringIO broken_html = "<html><head><title>test<body><h1>page title</h3>" parser = etree.HTMLParser() tree   = etree.parse(StringIO.StringIO(broken_html)) 

Result:

Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220)   File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482)   File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)   File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)   File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)   File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)   File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)   File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50 
like image 263
diemacht Avatar asked Oct 01 '13 14:10

diemacht


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is the difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

What is lxml parser?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


2 Answers

Don't just construct that parser, use it (as per the example you link to):

>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser) >>> tree <lxml.etree._ElementTree object at 0x2fd8e60> 

Or use lxml.html as a shortcut:

>>> from lxml import html >>> broken_html = "<html><head><title>test<body><h1>page title</h3>" >>> html.fromstring(broken_html) <Element html at 0x2dde650> 
like image 193
Fred Foo Avatar answered Sep 18 '22 05:09

Fred Foo


lxml allows you load a broken xml by creating a parser instance with recover=True

etree.HTMLParser(recover=True) 

You could use the same technique when creating the parser.

like image 40
Jerome Anthony Avatar answered Sep 18 '22 05:09

Jerome Anthony