Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the maximum size of an XML file when using python's lxml etree

In our application we use python's lxml to read an XML string in memory:

parser = etree.XMLParser(... huge_tree=False)
xml = etree.fromstring(src, parser)

I noticed that it bails out when the content of src is more than 10Mb. This is the expected behaviour with huge_tree set to False.

What I can't find information on is: why 10Mb? The documentation says:

huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

Also, libxml's changelog says:

include/libxml/parserInternals.h SAX2.c: add a new define XML_MAX_TEXT_LENGTH limiting the maximum size of a single text node, the defaultis 10MB and can be removed with the HUGE parsing option

However I don't understand if this is hard-coded, and why was this choice ever made.

The reason I'm asking is that we're dealing with the occasional input larger than that (when there is a large binary attachment, for example) and perhaps it's possible to raise that limit to a more reasonable value, without disabling it completely.

like image 448
lorenzog Avatar asked Nov 20 '15 14:11

lorenzog


People also ask

What is lxml Etree in Python?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

How do I view XML files in Python?

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().

Does lxml come with Python?

Support the project. lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.


1 Answers

The 10000000 value is hard-coded in parserInternals.h of libxml. The limit was initially imposed shortly after a fix for CVE-2008-4226, which addressed an issue where extremely large text nodes would cause a memory overflow (by overflowing the amount of addressable memory).

The 10 MB value is arbitrary, which is why there's an option to override it. It seems to be intended to help mitigate exploits of memory-overflow errors in libxml from appearing in the wild by requiring that the programmer explicitly request that the parser allocates as much memory as possible (basically size_t) to the text node.

That doesn't quite answer why 10 MB, but it probably seemed large enough to deal with the case of programmers just throwing XML at the parser without thinking about whether or not to trust the source of the file.

like image 57
ig0774 Avatar answered Sep 21 '22 18:09

ig0774