I want to convert unicode string into iso-8859-15. These strings include the u"\u2019"
(RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.
In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?
I have looked at the unicodedata module without success. I manage to do the job with
s.replace(u"\u2019", "'").encode('iso-8859-15')
but I would like to find a more general and cleaner way.
Thanks for your help
Use the unicode version of the translate
function, assuming s
is a unicode string:
s.translate({ord(u"\u2019"):ord(u"'")})
The argument of the unicode version of translate
is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.
You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:
char_mappings = [(u"\u2019", u"'"),
(u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}
From translate documentation:
For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).
Unless you wish to create a translation rule (if you do, look at Boud's answer), you could choose one of the default error handlers encode
provides or even register your own one:
In [4]: u'\u2019 Hi'.encode('iso-8859-15', 'replace')
Out[4]: '? Hi'
In [5]: u'\u2019 Hi'.encode('iso-8859-15', 'ignore')
Out[5]: ' Hi'
In [6]: u'\u2019 Hi'.encode('iso-8859-15', 'xmlcharrefreplace')
Out[6]: '’ Hi'
From encode
docstring:
S.encode([encoding[,errors]]) -> string or unicode
Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With