Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommended way to generate XHTML documents with lxml

The Python library lxml appears to provide several builders for generating HTML documents. What's the difference between these?

But these generate plain HTML, rather than XHTML. While I could manually add in the xmlns declarations, that's inelegant. So what's the recommended way to generate XHTML documents with lxml?

lxml.builder.E

Example from http://lxml.de/tutorial.html#the-e-factory:

>>> from lxml.builder import E

>>> def CLASS(*args): # class is a reserved word in Python
...     return {"class":' '.join(args)}

>>> html = page = (
...   E.html(       # create an Element called "html"
...     E.head(
...       E.title("This is a sample document")
...     ),
...     E.body(
...       E.h1("Hello!", CLASS("title")),
...       E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
...       E.p("This is another paragraph, with a", "\n      ",
...         E.a("link", href="http://www.python.org"), "."),
...       E.p("Here are some reserved characters: <spam&egg>."),
...       etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
...     )
...   )
... )

lxml.html.builder

Example from http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory:

>>> from lxml.html import builder as E
>>> from lxml.html import usedoctest
>>> html = E.HTML(
...   E.HEAD(
...     E.LINK(rel="stylesheet", href="great.css", type="text/css"),
...     E.TITLE("Best Page Ever")
...   ),
...   E.BODY(
...     E.H1(E.CLASS("heading"), "Top News"),
...     E.P("World News only on this page", style="font-size: 200%"),
...     "Ah, and here's some more text, by the way.",
...     lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
...   )
... )
like image 316
Mechanical snail Avatar asked Jun 03 '12 23:06

Mechanical snail


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Is lxml faster than BeautifulSoup?

lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

Why is lxml used?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.


1 Answers

Mixing ElementMaker and E from lxml.builder does the trick for me:

from lxml import etree
from lxml.builder import ElementMaker,E

M=ElementMaker(namespace=None,
               nsmap={None: "http://www.w3.org/1999/xhtml"})
html = M.html(E.head(E.title("Test page")),
              E.body(E.p("Hello world")))
result = etree.tostring(html,
                        xml_declaration=True,
                        doctype='<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">',
                        encoding='utf-8',
                        standalone=False,
                        with_tail=False,
                        method='xml',
                        pretty_print=True)
print result

The result is

<?xml version='1.0' encoding='utf-8' standalone='no'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Test page</title>
  </head>
  <body>
    <p>Hello world</p>
  </body>
</html>
like image 74
Pedro A. Aranda Avatar answered Oct 26 '22 14:10

Pedro A. Aranda