That seems that I've used wrong function. With .fromstring
- there're no error messages
xml_ = load() # here comes the unicode string with Cyrillic letters
print xml_ # prints everything fine
print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode
xml = xml_.decode('utf-8') # here is an error
doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here
File "testLog.py", line 48, in <module>
xml = xml_.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)
If
xml = xml_.encode('utf-8')
doc = lxml.etree.parse(xml) # here's an error
or
xml = xml_
then
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)
If I understand it right: I must decode non-ascii string into internal representation, then work with this representation and encode it back before sending to output? It seems that I do exactly this.
Input data must be in unt-8 due to the 'Accept-Charset': 'utf-8'
header.
String and Unicode objects have different types and different representations of their content in memory. Unicode is the decoded form of text while string is an encoded one.
# -*- coding: utf-8 --
# Now, my string literals in this source file will
# be str objects encoded in utf-8.
# In Python3, they will be unicode objects.
# Below examples show the Python2 way.
s = 'ş'
print type(s) # prints <type 'str'>
u = s.decode('utf-8')
# Here, we create a unicode object from a string
# which was encoded in utf-8.
print type(u) # prints <type 'unicode'>
As you see,
.encode() --> str
.decode() --> unicode
When we encode to or decode from strings, we need to be sure that our text should be covered in the source/target encoding. An iso-8859-1 encoded string cannot be decoded correctly with iso-8859-9.
As for the second error report in the question, lxml.etree.parse()
works on file-like objects. To parse from strings, lxml.etree.fromstring()
should be used.
If your original string is unicode it only makes sense to encode it to utf-8 not decode from utf-8.
I think the xml parser can handle only xml which is ascii.
So use xml = xml_.encode('ascii','xmlcharrefreplace')
to convert the unicode characters that are not in ascii to xml entitities.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With