Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating XML with lxml in Python: how to know how much of the input file has been read?

Tags:

python

xml

lxml

I'm using lxml.etree.iterparse() to iterate through a large XML file.

I would like to know how far I've got in the parsing of the input file, so that I might get a progress indicator.

My first idea was to use os.stat( filename ).st_size to know how big is my XML file, then as I'm getting events from the parser, retrieve the current position in the file. But I can't figure out how lxml.etree could give me access to its internal position. iterparse() is taking a filename as its source argument, so I can't open myself the file and call its tell() method to known how many bytes have been read so far.

Are you aware of any lxml.etree built-in indicator for current parser progression ? Or do you have an idea of integrating such a progression ?

like image 758
Mickaël Le Baillif Avatar asked Jun 12 '13 17:06

Mickaël Le Baillif


People also ask

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

How do you parse an XML string in Python?

3.2 Parsing an XML String We use the ElementTree. fromstring() method to parse an XML string. The method returns root Element directly: a subtle difference compared with the ElementTree. parse() method which returns an ElementTree object.

What is lxml in BeautifulSoup?

lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.

What is Etree in Python?

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.


1 Answers

You could pass a file object to iterparse, and then call f.tell(). This will give you the approximate position of the Element in the file.

import lxml.etree as ET
import os

filename = 'data.xml'
total_size = os.path.getsize(filename)
with open(filename, 'r') as f:
    context = ET.iterparse(f, events=('end', ), tag='Record')
    for event, elem in context:
        print(event, elem, float(f.tell())/total_size)

will yield something like

(u'end', <Element Record at 0xb743e2d4>, 0.09652665470688218)
(u'end', <Element Record at 0xb743e2fc>, 0.09652665470688218)
(u'end', <Element Record at 0xb743e324>, 0.09652665470688218)
...
(u'end', <Element Record at 0xb744739c>, 1.0)
(u'end', <Element Record at 0xb74473c4>, 1.0)
(u'end', <Element Record at 0xb74473ec>, 1.0)
like image 87
unutbu Avatar answered Sep 28 '22 20:09

unutbu