I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here:
Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way? We assume, that each author appears exactly once.
<root>
<author>Tim</author>
<author>Eva</author>
<author>Martin</author>
etc.
</root>
This xml file is trivial, but it is possible, that the authors are not always listed one after another, there may be other tags between them.
If you want to count all author tags:
import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')
Use an XPath with count
.
One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)
But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)
from time import clock
n= 10000
print 'n ==',n,'\n'
import lxml.etree
doc = lxml.etree.parse('xml.txt')
te = clock()
for i in xrange(n):
countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'
import re
with open('xml.txt') as f:
ch = f.read()
regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'
result
n == 10000
lxml
count: 3.0
2.84083032899 seconds
re
count: 3
0.141663256084 seconds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With