I have to read an XML file in Python and grab various things, and I ran into a frustrating error with Unicode Encode Error that I couldn't figure out even with googling.
Here are snippets of my code:
#!/usr/bin/python
# coding: utf-8
from xml.dom.minidom import parseString
with open('data.txt','w') as fout:
#do a lot of stuff
nameObj = data.getElementsByTagName('name')[0]
name = nameObj.childNodes[0].nodeValue
#... do more stuff
fout.write(','.join((name,bunch of other stuff))
This spectacularly crashes when a name entry I am parsing contains a Euro sign. Here is the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 60: ordinal not in range(128)
I understand why Euro sign will screw it up (because it's at 128, right?), but I thought doing # coding: utf-8 would fix that. I also tried adding .encode(utf-8) so that the name looks instead like
name = nameObj.childNodes[0].nodeValue.encode(utf-8)
But that doesn't work either. What am I doing wrong? (I am using Python 2.7.3 if anyone wants to know)
EDIT: Python crashes out on the fout.write() line -- it will go through fine where the name field is like:
<name>United States, USD</name>
But will crap out on name fields like:
<name>France, € </name>
when you are opening a file in python using the open
built-in function you will always read the file in ascii. To access it in another encoding you have to use codecs:
import codecs
fout = codecs.open('data.txt','w','utf-8')
It looks like you're getting Unicode data from your XML parser, but you're not encoding it before writing it out. You can explicitly encode the result before writing it out to the file:
text = ",".join(stuff) # this will be unicode if any value in stuff is unicode
encoded = text.encode("utf-8") # or use whatever encoding you prefer
fout.write(encoded)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With