Python ElementTree won't convert non-breaking spaces when using UTF-8 for output

Tags:

I'm trying to parse, manipulate, and output HTML using Python's ElementTree:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import entitydefs

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

When I run this using Python 2.7 on Mac OS X 10.6, I get:

<p>Less than &lt;</p>

Traceback (most recent call last):
  File "bar.py", line 20, in <module>
    print ET.tostring(p, encoding='utf-8')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1120, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 931, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1067, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)

I thought that specifying "encoding='UTF-8'" would take care of the non-breaking space character, but apparently it doesn't. What should I do instead?

780

asked May 18 '12 13:05

Greg Wilson

1 Answers

0xA0 is a latin1 character, not a unicode character and the value of p.text in the loop is a str and not unicode, that means that in order to encode it in utf-8 it must first be converted by Python implicitly into a unicode string (i.e. using decode). When it is doing this it assumes ascii since it wasn't told anything else. 0xa0 is not a valid ascii character, but it is a valid latin1 character.

The reason you have latin1 characters instead of unicode characters is because entitydefs is a mapping of names to latin1 encode strings. You need the unicode code point which you can get from htmlentitydef.name2codepoint

The version below should fix it for you:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import name2codepoint

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

166

answered Oct 04 '22 02:10

lambacck

Related questions
                            
                                Storing user and password in a database
                            
                                python urllib2: connection reset by peer
                            
                                Reading unicode elements into numpy array
                            
                                Pycharm warns about Unexpected type in a SqlAlchemy model
                            
                                setuptools: data files included with `bdist` but not with `sdist`
                            
                                Programmatically tell if a Unicode character takes up more than one character space in a terminal
                            
                                Why str can't get a second parameter,when __str__ can?
                            
                                Python Modules: When one imports them, do they go into memory?
                            
                                In Python - Parsing a response xml and finding a specific text vaule
                            
                                usr/bin/env: bad interpreter Permission Denied --> how to change the fstab
                            
                                How can I prompt for input using Selenium/Webdriver and use the result?
                            
                                How do I access dictionary keys that contain hyphens from within a Django template?
                            
                                Are classless methods in Python useful for anything?
                            
                                Rock paper Scissors bot algorithm
                            
                                Making Django go green
                            
                                Catching ArgumentTypeError exception from custom action
                            
                                Setting up TkHtml (a Tk widget) with Python
                            
                                how to change a function in existing 3rd party library in python
                            
                                Decrypting in Python an string encrypted using .NET
                            
                                python randomly sort items of the same value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python ElementTree won't convert non-breaking spaces when using UTF-8 for output

Tags:

python

xml

encoding

elementtree

Greg Wilson

People also ask

1 Answers

lambacck

Recent Activity

Donate For Us