The Python library lxml
appears to provide several builders for generating HTML documents. What's the difference between these?
But these generate plain HTML, rather than XHTML. While I could manually add in the xmlns declarations, that's inelegant. So what's the recommended way to generate XHTML documents with lxml?
lxml.builder.E
Example from http://lxml.de/tutorial.html#the-e-factory:
>>> from lxml.builder import E
>>> def CLASS(*args): # class is a reserved word in Python
... return {"class":' '.join(args)}
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!", CLASS("title")),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
... )
... )
... )
lxml.html.builder
Example from http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory:
>>> from lxml.html import builder as E
>>> from lxml.html import usedoctest
>>> html = E.HTML(
... E.HEAD(
... E.LINK(rel="stylesheet", href="great.css", type="text/css"),
... E.TITLE("Best Page Ever")
... ),
... E.BODY(
... E.H1(E.CLASS("heading"), "Top News"),
... E.P("World News only on this page", style="font-size: 200%"),
... "Ah, and here's some more text, by the way.",
... lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
... )
... )
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
Mixing ElementMaker and E from lxml.builder does the trick for me:
from lxml import etree
from lxml.builder import ElementMaker,E
M=ElementMaker(namespace=None,
nsmap={None: "http://www.w3.org/1999/xhtml"})
html = M.html(E.head(E.title("Test page")),
E.body(E.p("Hello world")))
result = etree.tostring(html,
xml_declaration=True,
doctype='<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">',
encoding='utf-8',
standalone=False,
with_tail=False,
method='xml',
pretty_print=True)
print result
The result is
<?xml version='1.0' encoding='utf-8' standalone='no'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test page</title>
</head>
<body>
<p>Hello world</p>
</body>
</html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With