I have written a small application that needs to have access to the DOM representation of the underlying HTML page. Lxml is really great but I have not been able to find such an interface. Does someone know if one exists or if there is another tool that does it?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.
According to the lxml documentation, it's possible to use lxml to parse the document, and its SAX parser can interface with the Python xml.dom.pulldom module to create a DOM object. From the documentation, the code might look like:
from xml.dom.pulldom import SAX2DOM
handler = SAX2DOM()
lxml.sax.saxify(tree, handler)
dom = handler.document
There is an example of parsing HTML at lxml site:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> parser = etree.HTMLParser()
>>> tree = etree.parse(StringIO(broken_html), parser)
>>> result = etree.tostring(tree.getroot(),
... pretty_print=True, method="html")
>>> print(result)
<html>
<head>
<title>test</title>
</head>
<body>
<h1>page title</h1>
</body>
</html>
You can access tree elements using methods tree.find, tree.findall, tree.iter, tree.xpath
and other. For example:
>>> tree.getroot().getchildren()
[<Element head at 0x4f4ad38>, <Element body at 0x4f4ad80>]
>>> tree.getroot()..find('body')
<Element body at 0x4f4ad80>
You can also use standard Python XML interfaces, as it was pointed by Kurt:
>>> from xml.dom.pulldom import SAX2DOM
>>> handler = SAX2DOM()
>>> lxml.sax.saxify(tree, handler)
>>> dom = handler.document
>>> print(dom.firstChild.localName)
But remember that lxml API is superior to dom/minidom.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With