Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an elegant way to count tag elements in a xml file using lxml in python?

I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here:

Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way? We assume, that each author appears exactly once.

<root>
    <author>Tim</author>
    <author>Eva</author>
    <author>Martin</author>
    etc.
</root>

This xml file is trivial, but it is possible, that the authors are not always listed one after another, there may be other tags between them.

like image 431
Aufwind Avatar asked Jun 26 '11 12:06

Aufwind


3 Answers

If you want to count all author tags:

import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')
like image 180
zeekay Avatar answered Oct 04 '22 04:10

zeekay


Use an XPath with count.

like image 39
Katriel Avatar answered Oct 04 '22 02:10

Katriel


One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)

But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)

from time import clock
n= 10000
print 'n ==',n,'\n'



import lxml.etree
doc = lxml.etree.parse('xml.txt')

te = clock()
for i in xrange(n):
    countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'



import re
with open('xml.txt') as f:
    ch = f.read()

regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
    countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'

result

n == 10000 

lxml
count: 3.0 
2.84083032899 seconds

re
count: 3 
0.141663256084 seconds
like image 39
eyquem Avatar answered Oct 04 '22 02:10

eyquem