I'm using lxml.etree.iterparse()
to iterate through a large XML file.
I would like to know how far I've got in the parsing of the input file, so that I might get a progress indicator.
My first idea was to use os.stat( filename ).st_size
to know how big is my XML file, then as I'm getting events from the parser, retrieve the current position in the file. But I can't figure out how lxml.etree
could give me access to its internal position. iterparse()
is taking a filename as its source
argument, so I can't open myself the file and call its tell()
method to known how many bytes have been read so far.
Are you aware of any lxml.etree
built-in indicator for current parser progression ? Or do you have an idea of integrating such a progression ?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
3.2 Parsing an XML String We use the ElementTree. fromstring() method to parse an XML string. The method returns root Element directly: a subtle difference compared with the ElementTree. parse() method which returns an ElementTree object.
lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.
You could pass a file object to iterparse
, and then call f.tell()
.
This will give you the approximate position of the Element in the file.
import lxml.etree as ET
import os
filename = 'data.xml'
total_size = os.path.getsize(filename)
with open(filename, 'r') as f:
context = ET.iterparse(f, events=('end', ), tag='Record')
for event, elem in context:
print(event, elem, float(f.tell())/total_size)
will yield something like
(u'end', <Element Record at 0xb743e2d4>, 0.09652665470688218)
(u'end', <Element Record at 0xb743e2fc>, 0.09652665470688218)
(u'end', <Element Record at 0xb743e324>, 0.09652665470688218)
...
(u'end', <Element Record at 0xb744739c>, 1.0)
(u'end', <Element Record at 0xb74473c4>, 1.0)
(u'end', <Element Record at 0xb74473ec>, 1.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With