I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":
<Database>
<BlogPost>
<Date>MM/DD/YY</Date>
<Author>Last Name, Name</Author>
<Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
</BlogPost>
<BlogPost>
<Date>MM/DD/YY</Date>
<Author>Last Name, Name</Author>
<Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
</BlogPost>
[...]
<BlogPost>
<Date>MM/DD/YY</Date>
<Author>Last Name, Name</Author>
<Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
</BlogPost>
</Database>
So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.
Here's part of what I have:
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
if element.tag == "BlogPost":
print element.text
else:
print 'Finished'
The result of that is only blank spaces, with no text in them.
I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
for child in element:
print(child.tag, child.text)
element.clear()
the final clear will stop you from using too much memory.
[update:] to get "everything between ... as a string" i guess you want one of:
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(etree.tostring(element))
element.clear()
or
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(''.join([etree.tostring(child) for child in element]))
element.clear()
or perhaps even:
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
print(''.join([child.text for child in element]))
element.clear()
For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
for child in element:
print(child.tag, child.text)
element.clear()
^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).
Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost>
elements) underneath the root node (e.g. <Database>
). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.
from lxml import etree
xmlfile = '/path/to/xml/file.xml'
def iterate_xml(xmlfile):
doc = etree.iterparse(xmlfile, events=('start', 'end'))
_, root = next(doc)
start_tag = None
for event, element in doc:
if event == 'start' and start_tag is None:
start_tag = element.tag
if event == 'end' and element.tag == start_tag:
yield element
start_tag = None
root.clear()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With