all,
I have a huge xml file, and need to first check the value of "status" tag in the root. It eat even double of the memory than when processing tag = item. And I have no idea why. I use lxml version 2.3.2, and python 2.7.3 in ubuntu 14.04. Structure of the xml is as below:
<root>
<status>s_value</status>
<count>c_value</count>
<items>
<item>***</item>
<item>***</item>
...
</items>
</root>
I try to process the file as below (ignore the namespace):
from lxml import etree
status = etree.iterparse('file.xml', tag='status')
for event, element in status:
value = element.text
element.clear()
del status
This code still eat a lot memory and also take long time (15s). I tried to use a "break", and it gets the same result, but is much faster (1s), can not see the memory usage, as it is fast.
from lxml import etree
status = etree.iterparse('file.xml', tag='status')
for event, element in status:
value = element.text
element.clear()
break
del status
It seems something happen after run the first status, but as there is only one element in status, I am wondering what is processed? Dose any one has any idea what is happening? Thanks very much
It seems something happen after run the first status
Yes. It is vainly searching for the second status.
Without the break, your loop must process the entire file. The loop searches for all of the <status>
tags. Without reading the file to the end, it cannot know if it has found the final tag.
Alternatively, with the break, the loop stops immediately.
Consider these two loops:
for i in range(1000000):
if i == 1:
print(i)
for i in range(1000000):
if i == 1:
print(i)
break
Hopefully, you can see that the first loop must run one million times, even though it will find the one-and-only 1
immediately.
Similarly with your code, your non-break loop must run over a bajillion lines, even though it will find the one-and-only <status>
immediately.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With