Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize unicode encoding for iso-8859-15 conversion in python?

I want to convert unicode string into iso-8859-15. These strings include the u"\u2019" (RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.

In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?

I have looked at the unicodedata module without success. I manage to do the job with

s.replace(u"\u2019", "'").encode('iso-8859-15')

but I would like to find a more general and cleaner way.

Thanks for your help

like image 718
luc Avatar asked Dec 07 '22 14:12

luc


2 Answers

Use the unicode version of the translate function, assuming s is a unicode string:

s.translate({ord(u"\u2019"):ord(u"'")})

The argument of the unicode version of translate is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.

You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:

char_mappings = [(u"\u2019", u"'"),
                 (u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}

From translate documentation:

For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).

like image 104
Zeugma Avatar answered Dec 11 '22 09:12

Zeugma


Unless you wish to create a translation rule (if you do, look at Boud's answer), you could choose one of the default error handlers encode provides or even register your own one:

In [4]: u'\u2019 Hi'.encode('iso-8859-15', 'replace')
Out[4]: '? Hi'

In [5]: u'\u2019 Hi'.encode('iso-8859-15', 'ignore')
Out[5]: ' Hi'

In [6]: u'\u2019 Hi'.encode('iso-8859-15', 'xmlcharrefreplace')
Out[6]: '’ Hi'

From encode docstring:

S.encode([encoding[,errors]]) -> string or unicode

Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.

like image 27
Lev Levitsky Avatar answered Dec 11 '22 10:12

Lev Levitsky