Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: I use .decode() - 'ascii' codec can't encode

Tags:

python

unicode

That seems that I've used wrong function. With .fromstring - there're no error messages

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_    # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here

 File "testLog.py", line 48, in <module>
    xml = xml_.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

If

xml = xml_.encode('utf-8')

doc = lxml.etree.parse(xml) # here's an error

or

xml = xml_

then

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

If I understand it right: I must decode non-ascii string into internal representation, then work with this representation and encode it back before sending to output? It seems that I do exactly this.

Input data must be in unt-8 due to the 'Accept-Charset': 'utf-8' header.

like image 423
Ben Usman Avatar asked Dec 21 '22 19:12

Ben Usman


2 Answers

String and Unicode objects have different types and different representations of their content in memory. Unicode is the decoded form of text while string is an encoded one.

# -*- coding: utf-8 --

# Now, my string literals in this source file will
#    be str objects encoded in utf-8.

# In Python3, they will be unicode objects.
#    Below examples show the Python2 way.

s = 'ş'
print type(s) # prints <type 'str'>

u = s.decode('utf-8')
# Here, we create a unicode object from a string
#    which was encoded in utf-8.

print type(u) # prints <type 'unicode'>

As you see,

.encode() --> str
.decode() --> unicode

When we encode to or decode from strings, we need to be sure that our text should be covered in the source/target encoding. An iso-8859-1 encoded string cannot be decoded correctly with iso-8859-9.

As for the second error report in the question, lxml.etree.parse() works on file-like objects. To parse from strings, lxml.etree.fromstring() should be used.

like image 112
hasanyasin Avatar answered Dec 23 '22 08:12

hasanyasin


If your original string is unicode it only makes sense to encode it to utf-8 not decode from utf-8.

I think the xml parser can handle only xml which is ascii.

So use xml = xml_.encode('ascii','xmlcharrefreplace') to convert the unicode characters that are not in ascii to xml entitities.

like image 32
Marco de Wit Avatar answered Dec 23 '22 09:12

Marco de Wit