Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python lxml eats a lot of memory when only one element exist

all,

I have a huge xml file, and need to first check the value of "status" tag in the root. It eat even double of the memory than when processing tag = item. And I have no idea why. I use lxml version 2.3.2, and python 2.7.3 in ubuntu 14.04. Structure of the xml is as below:

<root>
<status>s_value</status>
<count>c_value</count>
<items>
<item>***</item>
<item>***</item>
...
</items>
</root>

I try to process the file as below (ignore the namespace):

from lxml import etree
status = etree.iterparse('file.xml', tag='status')
for event, element in status:
    value = element.text
    element.clear()
del status

This code still eat a lot memory and also take long time (15s). I tried to use a "break", and it gets the same result, but is much faster (1s), can not see the memory usage, as it is fast.

from lxml import etree
status = etree.iterparse('file.xml', tag='status')
for event, element in status:
    value = element.text
    element.clear()
    break
del status

It seems something happen after run the first status, but as there is only one element in status, I am wondering what is processed? Dose any one has any idea what is happening? Thanks very much

like image 220
zhihong Avatar asked Nov 09 '22 12:11

zhihong


1 Answers

It seems something happen after run the first status

Yes. It is vainly searching for the second status.

Without the break, your loop must process the entire file. The loop searches for all of the <status> tags. Without reading the file to the end, it cannot know if it has found the final tag.

Alternatively, with the break, the loop stops immediately.

Consider these two loops:

for i in range(1000000):
    if i == 1:
        print(i)

for i in range(1000000):
    if i == 1:
        print(i)
        break

Hopefully, you can see that the first loop must run one million times, even though it will find the one-and-only 1 immediately.

Similarly with your code, your non-break loop must run over a bajillion lines, even though it will find the one-and-only <status> immediately.

like image 61
Robᵩ Avatar answered Nov 14 '22 22:11

Robᵩ