I have the following code:
import string def translate_non_alphanumerics(to_translate, translate_to='_'): not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~' translate_table = string.maketrans(not_letters_or_digits, translate_to *len(not_letters_or_digits)) return to_translate.translate(translate_table)
Which works great for non-unicode strings:
>>> translate_non_alphanumerics('<foo>!') '_foo__'
But fails for unicode strings:
>>> translate_non_alphanumerics(u'<foo>!') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in translate_non_alphanumerics TypeError: character mapping must return integer, None or unicode
I can't make any sense of the paragraph on "Unicode objects" in the Python 2.6.2 docs for the str.translate() method.
How do I make this work for Unicode strings?
To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
TRANSLATE is a string manipulation function that manipulates all string data types (BIT, BLOB, and CHARACTER), and replaces specified characters in a string.
The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord
) to Unicode ordinals. If you want to delete characters, you map to None
.
I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:
def translate_non_alphanumerics(to_translate, translate_to=u'_'): not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~' translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits) return to_translate.translate(translate_table) >>> translate_non_alphanumerics(u'<foo>!') u'_foo__'
edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord
) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to
to be a Unicode literal. For example:
>>> translate_non_alphanumerics(u'<foo>!', u'bad') u'badfoobadbad'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With