With the lxml.etree
python framework, is it more efficient to parse xml directly from a link to an online xml file or is it better to say, use a different framework (such as urllib2
), to return a string and then parse from that? Or does it make no difference at all?
Method 1 - Parse directly from link
from lxml import etree as ET
parsed = ET.parse(url_link)
Method 2 - Parse from string
from lxml import etree as ET
import urllib2
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.parse.fromstring(xml_string)
# note: I do not have access to python
# at the moment, so not sure whether
# the .fromstring() function is correct
Or is there a more efficient method than either of these, e.g. save the xml to a .xml file on desktop then parse from those?
I ran the two methods with a simple timing rapper.
Method 1 - Parse XML Directly From Link
from lxml import etree as ET
@timing
def parseXMLFromLink():
parsed = ET.parse(url_link)
print parsed.getroot()
for n in range(0,100):
parseXMLFromLink()
Average of 100 = 98.4035 ms
Method 2 - Parse XML From String Returned By Urllib2
from lxml import etree as ET
import urllib2
@timing
def parseXMLFromString():
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.fromstring(xml_string)
print parsed
for n in range(0,100):
parseXMLFromString()
Average of 100 = 286.9630 ms
So anecdotally it seems that using lxml to parse directly from the link is the more immediately quick method. It's not clear whether it would be faster to download then parse large xml documents from the hard drive, but presumably unless the document is huge and the parsing task more intensive, the parseXMLFromLink()
function would still remain quicker as it is urllib2 that seems to slow the second function down.
I ran this a few times and the results stayed the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With