Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get str.translate to work with Unicode strings?

I have the following code:

import string def translate_non_alphanumerics(to_translate, translate_to='_'):     not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'     translate_table = string.maketrans(not_letters_or_digits,                                        translate_to                                          *len(not_letters_or_digits))     return to_translate.translate(translate_table) 

Which works great for non-unicode strings:

>>> translate_non_alphanumerics('<foo>!') '_foo__' 

But fails for unicode strings:

>>> translate_non_alphanumerics(u'<foo>!') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "<stdin>", line 5, in translate_non_alphanumerics TypeError: character mapping must return integer, None or unicode 

I can't make any sense of the paragraph on "Unicode objects" in the Python 2.6.2 docs for the str.translate() method.

How do I make this work for Unicode strings?

like image 566
Daryl Spitzer Avatar asked Aug 24 '09 18:08

Daryl Spitzer


People also ask

How do I add a Unicode to a string in Python?

To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.

Does Python recognize Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

Which function translates certain characters in a string?

TRANSLATE is a string manipulation function that manipulates all string data types (BIT, BLOB, and CHARACTER), and replaces specified characters in a string.


1 Answers

The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord) to Unicode ordinals. If you want to delete characters, you map to None.

I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:

def translate_non_alphanumerics(to_translate, translate_to=u'_'):     not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'     translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)     return to_translate.translate(translate_table)  >>> translate_non_alphanumerics(u'<foo>!') u'_foo__' 

edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to to be a Unicode literal. For example:

>>> translate_non_alphanumerics(u'<foo>!', u'bad') u'badfoobadbad' 
like image 187
Mike Boers Avatar answered Sep 24 '22 00:09

Mike Boers