Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing broken XML with lxml.etree.iterparse

Tags:

I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the file contains some bad ascii characters that break the default parser. The parser works if I set recover=True, but the iterparse method doesn't take the recover parameter or a custom parser object. Does anyone know how to use iterparse to parse broken xml?

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

Thanks for your help!

EDIT -- Here is an example of the types of encoding errors I'm running into:

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist  

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}

As you can see, chardet thinks it is an ascii file, but there is a "\x1e" right in the middle of this example which is making lxml raise an exception.

like image 999
erikcw Avatar asked Feb 28 '10 21:02

erikcw


People also ask

How do you parse lxml?

Since lxml 2.0, the parsers have a feed parser interface that is compatible to the ElementTree parsers. You can use it to feed data into the parser in a controlled step-by-step way. In lxml. etree, you can use both interfaces to a parser at the same time: the parse() or XML() functions, and the feed parser interface.

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

What is lxml in BeautifulSoup?

To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup into an lxml.

What is the difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.


2 Answers

Edit:

This is an older answer and I would have done it differently today. And I'm not just referring to the dumb snark ... since then BeutifulSoup4 is available and it's really quite nice. I recommend that to anyone who stumbles over here.


The currently accepted answer is, well, not what one should do. The question itself also has a bad assumption:

parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.

Actually recover=True is for recovering from misformed XML. There is however an "encoding" option which would have fixed your issue.

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

That's it, that's the solution.


BTW -- For anyone struggling with parsing XML in python, especially from third party sources. I know, I know, the documentation is bad and there are a lot of SO red herrings; a lot of bad advice.

  • lxml.etree.fromstring()? - That's for perfectly formed XML, silly
  • BeautifulStoneSoup? - Slow, and has a way-stupid policy for self closing tags
  • lxml.etree.HTMLParser()? - (because the xml is broken) Here's a secret - HTMLParser() is... a Parser with recover=True
  • lxml.html.soupparser? - The encoding detection is supposed to be better, but it has the same failings of BeautifulSoup for self closing tags. Perhaps you can combine XMLParser with BeautifulSoup's UnicodeDammit
  • UnicodeDammit and other cockamamie stuff to fix encodings? - Well, UnicodeDammit is kind of cute, I like the name and it's useful for stuff beyond xml, but things are usually fixed if you do the right thing with XMLParser()

You could be trying all sorts of stuff from what's available online. lxml documentation could be better. The code above is what you need for 90% of your XML parsing cases. Here I'll restate it:

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

You're welcome. My headaches == your sanity. Plus it has other features you might need for, you know, XML.

like image 190
Purrell Avatar answered Oct 14 '22 04:10

Purrell


I solved the problem by creating a class with a File like object interface. The class' read() method reads a line from the file and replaces any "bad characters" before returning the line to iterparse.

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

I had to edit the myFile class a few times adding some more replace() calls for a few other characters that were making lxml choke. I think lxml's SAX parsing would have worked as well (seems to support the recover option), but this solution worked like a charm!

like image 34
erikcw Avatar answered Oct 14 '22 06:10

erikcw