Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument interface for python lxml

I have written a small application that needs to have access to the DOM representation of the underlying HTML page. Lxml is really great but I have not been able to find such an interface. Does someone know if one exists or if there is another tool that does it?

like image 682
Dave Avatar asked Oct 24 '11 13:10

Dave


People also ask

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

Is lxml faster than BeautifulSoup?

It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

What is lxml in BeautifulSoup?

lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.


2 Answers

According to the lxml documentation, it's possible to use lxml to parse the document, and its SAX parser can interface with the Python xml.dom.pulldom module to create a DOM object. From the documentation, the code might look like:

from xml.dom.pulldom import SAX2DOM
handler = SAX2DOM()
lxml.sax.saxify(tree, handler)
dom = handler.document
like image 88
Kurt McKee Avatar answered Sep 23 '22 15:09

Kurt McKee


There is an example of parsing HTML at lxml site:

>>> from lxml import etree
>>> from StringIO import StringIO

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree   = etree.parse(StringIO(broken_html), parser)

>>> result = etree.tostring(tree.getroot(),
...                         pretty_print=True, method="html")
>>> print(result)
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>page title</h1>
  </body>
</html>

You can access tree elements using methods tree.find, tree.findall, tree.iter, tree.xpath and other. For example:

>>> tree.getroot().getchildren()
[<Element head at 0x4f4ad38>, <Element body at 0x4f4ad80>]

>>> tree.getroot()..find('body')
<Element body at 0x4f4ad80>

You can also use standard Python XML interfaces, as it was pointed by Kurt:

>>> from xml.dom.pulldom import SAX2DOM
>>> handler = SAX2DOM()
>>> lxml.sax.saxify(tree, handler)

>>> dom = handler.document
>>> print(dom.firstChild.localName)

But remember that lxml API is superior to dom/minidom.

like image 21
utapyngo Avatar answered Sep 22 '22 15:09

utapyngo