I'm trying to parse, manipulate, and output HTML using Python's ElementTree:
import sys
from cStringIO import StringIO
from xml.etree import ElementTree as ET
from htmlentitydefs import entitydefs
source = StringIO("""<html>
<body>
<p>Less than <</p>
<p>Non-breaking space </p>
</body>
</html>""")
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
etree = ET.ElementTree()
tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
print ET.tostring(p, encoding='UTF-8')
When I run this using Python 2.7 on Mac OS X 10.6, I get:
<p>Less than <</p>
Traceback (most recent call last):
File "bar.py", line 20, in <module>
print ET.tostring(p, encoding='utf-8')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1120, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 931, in _serialize_xml
write(_escape_cdata(text, encoding))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1067, in _escape_cdata
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)
I thought that specifying "encoding='UTF-8'" would take care of the non-breaking space character, but apparently it doesn't. What should I do instead?
Else there would be invisible characters which are not interpreted as UTF-8. Let’s see the the options to set the UTF-8 Encoding (If you are using Python 3, UTF-8 is the default source encoding) Set the Python encoding to UTF-8. This will ensure the fix for the current session . Set the environment variables in /etc/default/locale .
In Python 2, the default encoding is ASCII (unfortunately). UTF-16 is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each.
A good practice is to decode your bytes in UTF-8 (or an encoder that was used to create those bytes) as soon as they are loaded from a file. Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end.
A string of ASCII text is also valid UTF-8 text. UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
0xA0 is a latin1 character, not a unicode character and the value of p.text in the loop is a str and not unicode, that means that in order to encode it in utf-8 it must first be converted by Python implicitly into a unicode string (i.e. using decode). When it is doing this it assumes ascii since it wasn't told anything else. 0xa0 is not a valid ascii character, but it is a valid latin1 character.
The reason you have latin1 characters instead of unicode characters is because entitydefs is a mapping of names to latin1 encode strings. You need the unicode code point which you can get from htmlentitydef.name2codepoint
The version below should fix it for you:
import sys
from cStringIO import StringIO
from xml.etree import ElementTree as ET
from htmlentitydefs import name2codepoint
source = StringIO("""<html>
<body>
<p>Less than <</p>
<p>Non-breaking space </p>
</body>
</html>""")
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()
tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
print ET.tostring(p, encoding='UTF-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With