I'm loading data from a bunch of XML files with lxml.etree
, but I'd like to close them once I'm done with this initial parsing. Currently the XML_FILES
list in the below code takes up 350 MiB of the program's 400 MiB of used memory. I've tried del XML_FILES
, del XML_FILES[:]
, XML_FILES = None
, for etree in XML_FILES: etree = None
, and a few more, but none of these seem to be working. I also can't find anything in the lxml docs for closing an lxml file. Here's the code that does the parsing:
def open_xml_files():
return [etree.parse(filename) for filename in paths]
def load_location_data(xml_files):
location_data = {}
for xml_file in xml_files:
for city in xml_file.findall('City'):
code = city.findtext('CityCode')
name = city.findtext('CityName')
location_data['city'][code] = name
# [A few more like the one above]
return location_data
XML_FILES = utils.open_xml_files()
LOCATION_DATA = load_location_data(XML_FILES)
# XML_FILES never used again from this point on
Now, how do I get rid of XML_FILES here?
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.
lxml aims to provide a Pythonic API by following as much as possible the ElementTree API. We're trying to avoid inventing too many new APIs, or you having to learn new things -- XML is complicated enough.
The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element's end tag and before the next tag.
You might consider etree.iterparse
, which uses a generator rather than an in-memory list. Combined with a generator expression, this might save your program some memory.
def open_xml_files():
return (etree.iterparse(filename) for filename in paths)
iterparse
creates a generator over the parsed contents of the file, while parse
immediately parses the file and loads the contents into memory. The difference in memory usage comes from the fact that iterparse
doesn't actually do anything until its next()
method is called (in this case, implicitly via a for
loop).
EDIT: Apparently iterparse does work incrementally, but doesn't free memory as is parses. You could use the solution from this answer to free memory as you traverse the xml document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With