What is the maximum size of an XML file when using python's lxml etree

Tags:

In our application we use python's lxml to read an XML string in memory:

parser = etree.XMLParser(... huge_tree=False)
xml = etree.fromstring(src, parser)

I noticed that it bails out when the content of src is more than 10Mb. This is the expected behaviour with huge_tree set to False.

What I can't find information on is: why 10Mb? The documentation says:

huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

Also, libxml's changelog says:

include/libxml/parserInternals.h SAX2.c: add a new define XML_MAX_TEXT_LENGTH limiting the maximum size of a single text node, the defaultis 10MB and can be removed with the HUGE parsing option

However I don't understand if this is hard-coded, and why was this choice ever made.

The reason I'm asking is that we're dealing with the occasional input larger than that (when there is a large binary attachment, for example) and perhaps it's possible to raise that limit to a more reasonable value, without disabling it completely.

448

asked Nov 20 '15 14:11

lorenzog

1 Answers

The 10000000 value is hard-coded in parserInternals.h of libxml. The limit was initially imposed shortly after a fix for CVE-2008-4226, which addressed an issue where extremely large text nodes would cause a memory overflow (by overflowing the amount of addressable memory).

The 10 MB value is arbitrary, which is why there's an option to override it. It seems to be intended to help mitigate exploits of memory-overflow errors in libxml from appearing in the wild by requiring that the programmer explicitly request that the parser allocates as much memory as possible (basically size_t) to the text node.

That doesn't quite answer why 10 MB, but it probably seemed large enough to deal with the case of programmers just throwing XML at the parser without thinking about whether or not to trust the source of the file.

answered Sep 21 '22 18:09

ig0774

Related questions
                            
                                numerical sort a column containing numbers and strings (pandas/python)
                            
                                Sampling from a bounded domain zipf distribution
                            
                                Scrapy - handle exception when one of item fields is not returned
                            
                                What does numpy's percentile function do exactly?
                            
                                OrderedDict with specific order in Python
                            
                                How to avoid running a specific task simultaneously in Luigi with multiple workers
                            
                                How could I access localstorage under Python requests
                            
                                How to provide user defined function for python blaze with sqlite backend?
                            
                                Django - what is best practice - Calculating field values [closed]
                            
                                Interchanging between different scipy ode solvers
                            
                                Link error with cblas when installing scikit-learn
                            
                                using python itertools to generate custom iteration
                            
                                Flask-admin, editing relationship giving me object representation of Foreign Key object
                            
                                determine the coordinates where two pandas time series cross, and how many times the time series cross
                            
                                Transform input data for ALS in pyspark
                            
                                Python setuptools not including C++ standard library headers
                            
                                How to set custom timestep values for a series of legacy VTK files in ParaView?
                            
                                Splitting a Graphlab SFrame Date column into three columns (Year Month Day)
                            
                                Access ansible playbook results after run of playbook
                            
                                Python dictionary in Jinja

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the maximum size of an XML file when using python's lxml etree

Tags:

python

lxml

libxml2

lorenzog

People also ask

1 Answers

ig0774

Recent Activity

Donate For Us