Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I free up the memory used by an lxml.etree?

I'm loading data from a bunch of XML files with lxml.etree, but I'd like to close them once I'm done with this initial parsing. Currently the XML_FILES list in the below code takes up 350 MiB of the program's 400 MiB of used memory. I've tried del XML_FILES, del XML_FILES[:], XML_FILES = None, for etree in XML_FILES: etree = None, and a few more, but none of these seem to be working. I also can't find anything in the lxml docs for closing an lxml file. Here's the code that does the parsing:

def open_xml_files():
    return [etree.parse(filename) for filename in paths]

def load_location_data(xml_files):
    location_data = {}

    for xml_file in xml_files:
        for city in xml_file.findall('City'):
            code = city.findtext('CityCode')
            name = city.findtext('CityName')
            location_data['city'][code] = name

        # [A few more like the one above]    

    return location_data

XML_FILES = utils.open_xml_files()
LOCATION_DATA = load_location_data(XML_FILES)
# XML_FILES never used again from this point on

Now, how do I get rid of XML_FILES here?

like image 242
Underyx Avatar asked Mar 13 '14 14:03

Underyx


People also ask

What is lxml Etree in Python?

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

Why is lxml used?

lxml aims to provide a Pythonic API by following as much as possible the ElementTree API. We're trying to avoid inventing too many new APIs, or you having to learn new things -- XML is complicated enough.

What is tail in XML?

The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element's end tag and before the next tag.


1 Answers

You might consider etree.iterparse, which uses a generator rather than an in-memory list. Combined with a generator expression, this might save your program some memory.

def open_xml_files():
    return (etree.iterparse(filename) for filename in paths)

iterparse creates a generator over the parsed contents of the file, while parse immediately parses the file and loads the contents into memory. The difference in memory usage comes from the fact that iterparse doesn't actually do anything until its next() method is called (in this case, implicitly via a for loop).

EDIT: Apparently iterparse does work incrementally, but doesn't free memory as is parses. You could use the solution from this answer to free memory as you traverse the xml document.

like image 153
Emmett Butler Avatar answered Sep 28 '22 07:09

Emmett Butler