Is there a way to force lxml to parse Unicode strings that specify an encoding in a tag?

Tags:

I have an XML file that specifies an encoding, and I use UnicodeDammit to convert it to unicode (for reasons of storage, I can't store it as a string). I later pass it to lxml but it refuses to ignore the encoding specified in the file and parse it as Unicode, and it raises an exception.

How can I force lxml to parse the document? This behaviour seems too restrictive.

272

asked Aug 04 '10 04:08

Stavros Korokithakis

1 Answers

You cannot parse from unicode strings AND have an encoding declaration in the string. So, either you make it an encoded string (as you apparently can't store it as a string, you will have to re-encode it before parsing. Or you serialize the tree as unicode with lxml yourself: etree.tostring(tree, encoding=unicode), WITHOUT xml declaration. You can easily parse the result again with etree.fromunicode

see http://lxml.de/parsing.html#python-unicode-strings

Edit: If, apparently, you already have the unicode string, and can't control how that was made. You'll have to encode it again, and provide the parser with the encoding you used:

utf8_parser = etree.XMLParser(encoding='utf-8')

def parse_from_unicode(unicode_str):
    s = unicode_str.encode('utf-8')
    return etree.fromstring(s, parser=utf8_parser)

This will make sure that, whatever was inside the xml declaration gets ignored, because the parser will always use utf-8.

164

answered Oct 16 '22 04:10

Steven

Related questions
                            
                                What is the Elasticsearch-py equivalent to alias actions?
                            
                                Pass user built json encoder into Flask's jsonify
                            
                                Join/merge multiple NetCDF files using xarray
                            
                                Using assertTrue(==) vs assertEqual in unittest
                            
                                Complexity of deleting a key from python ordered dict
                            
                                keras version to use with tensorflow-gpu 1.4
                            
                                How to draw a multiple line chart using plotly_express?
                            
                                How to hide legend with Plotly Express and Plotly
                            
                                How can I install Anaconda aside an existing pyenv installation on OSX?
                            
                                Django Channels VS Django 3.0 / 3.1?
                            
                                Can compiled bytecode files (.pyc) get generated in different directory? [duplicate]
                            
                                How to perform common post-initialization tasks in inherited classes?
                            
                                How to Model a Foreign Key in a Reusable Django App?
                            
                                "NameError: name '' is not defined" after user input in Python [duplicate]
                            
                                None in boost.python
                            
                                How to log python program activity in Mac OS X
                            
                                2d convolution using python and numpy
                            
                                Why doesn't Python's `re.split()` split on zero-length matches?
                            
                                mysql LOAD DATA INFILE with auto-increment primary key
                            
                                Fetching just the Key/id from a ReferenceProperty in App Engine

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to force lxml to parse Unicode strings that specify an encoding in a tag?

Tags:

python

lxml

Stavros Korokithakis

People also ask

1 Answers

Steven

Recent Activity

Donate For Us