Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to get a line number from an ElementTree Element

Tags:

So I'm parsing some XML files using Python 3.2.1's cElementTree, and during the parsing I noticed that some of the tags were missing attribute information. I was wondering if there is any easy way of getting the line numbers of those Elements in the xml file.

like image 983
John Smith Avatar asked Aug 04 '11 22:08

John Smith


People also ask

What does Etree parse do?

Parsing from strings and files. lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What is ElementTree?

The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.


2 Answers

Took a while for me to work out how to do this using Python 3.x (using 3.3.2 here) so thought I would summarize:

# Force python XML parser not faster C accelerators # because we can't hook the C implementation sys.modules['_elementtree'] = None import xml.etree.ElementTree as ET  class LineNumberingParser(ET.XMLParser):     def _start_list(self, *args, **kwargs):         # Here we assume the default XML parser which is expat         # and copy its element position attributes into output Elements         element = super(self.__class__, self)._start_list(*args, **kwargs)         element._start_line_number = self.parser.CurrentLineNumber         element._start_column_number = self.parser.CurrentColumnNumber         element._start_byte_index = self.parser.CurrentByteIndex         return element      def _end(self, *args, **kwargs):         element = super(self.__class__, self)._end(*args, **kwargs)         element._end_line_number = self.parser.CurrentLineNumber         element._end_column_number = self.parser.CurrentColumnNumber         element._end_byte_index = self.parser.CurrentByteIndex         return element  tree = ET.parse(filename, parser=LineNumberingParser()) 
like image 198
Duncan Harris Avatar answered Sep 18 '22 13:09

Duncan Harris


Looking at the docs, I see no way to do this with cElementTree.

However I've had luck with lxmls version of the XML implementation. Its supposed to be almost a drop in replacement, using libxml2. And elements have a sourceline attribute. (As well as getting a lot of other XML features).

Only caveat is that I've only used it in python 2.x - not sure how/if it works under 3.x - but might be worth a look.

Addendum: from their front page they say :

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.3 to 3.2. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.

So it looks like python 3.x is OK.

like image 23
Michael Anderson Avatar answered Sep 18 '22 13:09

Michael Anderson