Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to force lxml to parse Unicode strings that specify an encoding in a tag?

Tags:

python

lxml

I have an XML file that specifies an encoding, and I use UnicodeDammit to convert it to unicode (for reasons of storage, I can't store it as a string). I later pass it to lxml but it refuses to ignore the encoding specified in the file and parse it as Unicode, and it raises an exception.

How can I force lxml to parse the document? This behaviour seems too restrictive.

like image 272
Stavros Korokithakis Avatar asked Aug 04 '10 04:08

Stavros Korokithakis


People also ask

How do you parse LXML?

Since lxml 2.0, the parsers have a feed parser interface that is compatible to the ElementTree parsers. You can use it to feed data into the parser in a controlled step-by-step way. In lxml. etree, you can use both interfaces to a parser at the same time: the parse() or XML() functions, and the feed parser interface.

What does LXML do?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.


1 Answers

You cannot parse from unicode strings AND have an encoding declaration in the string. So, either you make it an encoded string (as you apparently can't store it as a string, you will have to re-encode it before parsing. Or you serialize the tree as unicode with lxml yourself: etree.tostring(tree, encoding=unicode), WITHOUT xml declaration. You can easily parse the result again with etree.fromunicode

see http://lxml.de/parsing.html#python-unicode-strings

Edit: If, apparently, you already have the unicode string, and can't control how that was made. You'll have to encode it again, and provide the parser with the encoding you used:

utf8_parser = etree.XMLParser(encoding='utf-8')

def parse_from_unicode(unicode_str):
    s = unicode_str.encode('utf-8')
    return etree.fromstring(s, parser=utf8_parser)

This will make sure that, whatever was inside the xml declaration gets ignored, because the parser will always use utf-8.

like image 164
Steven Avatar answered Oct 16 '22 04:10

Steven