lxml offers a few different functions to parse strings. Two of them, etree.fromstring()
and etree.XML()
, seem very similar. The docstring for the former says it's for parsing "strings", while the latter "string constants". Additionally, XML()
's docstring states:
This function can be used to embed "XML literals" in Python code, [...]
What's the functional difference between these functions? When should one be used over the other?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.
Looking at the source code, for XML()
and fromstring()
, the former has this extra snippet of code:
if parser is None:
parser = __GLOBAL_PARSER_CONTEXT.getDefaultParser()
if not isinstance(parser, XMLParser):
parser = __DEFAULT_XML_PARSER
They thus differ in side effects: XML()
only uses the default XML parser as the default parser. If the default parser were changed to a non-XMLParser
, XML()
will ignore it.
etree.set_default_parser(etree.HTMLParser())
etree.tostring(etree.fromstring("<root/>"))
# b'<html><body><root/></body></html>'
etree.tostring(etree.XML("<root/>"))
# b'<root/>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With