In our application we use python's lxml to read an XML string in memory:
parser = etree.XMLParser(... huge_tree=False)
xml = etree.fromstring(src, parser)
I noticed that it bails out when the content of src
is more than 10Mb. This is the expected behaviour with huge_tree
set to False.
What I can't find information on is: why 10Mb? The documentation says:
huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)
Also, libxml's changelog says:
include/libxml/parserInternals.h SAX2.c: add a new define XML_MAX_TEXT_LENGTH limiting the maximum size of a single text node, the defaultis 10MB and can be removed with the HUGE parsing option
However I don't understand if this is hard-coded, and why was this choice ever made.
The reason I'm asking is that we're dealing with the occasional input larger than that (when there is a large binary attachment, for example) and perhaps it's possible to raise that limit to a more reasonable value, without disabling it completely.
lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
Support the project. lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
The 10000000 value is hard-coded in parserInternals.h
of libxml. The limit was initially imposed shortly after a fix for CVE-2008-4226, which addressed an issue where extremely large text nodes would cause a memory overflow (by overflowing the amount of addressable memory).
The 10 MB value is arbitrary, which is why there's an option to override it. It seems to be intended to help mitigate exploits of memory-overflow errors in libxml from appearing in the wild by requiring that the programmer explicitly request that the parser allocates as much memory as possible (basically size_t
) to the text node.
That doesn't quite answer why 10 MB, but it probably seemed large enough to deal with the case of programmers just throwing XML at the parser without thinking about whether or not to trust the source of the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With