Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Approximately converting unicode string to ascii string in python

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?

For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?

Thank you very much! Marco

like image 622
Marco Moschettini Avatar asked Nov 10 '11 22:11

Marco Moschettini


People also ask

How do you change Unicode to ASCII in Python?

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts.

How do you change a Unicode to a string in Python?

To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

How do you convert a string with Unicode encoding to a string of letters?

String str1 = "\u0000"; String str2 = "\uFFFF"; String str1 is assigned \u0000 which is the lowest value in Unicode. String str2 is assigned \uFFFF which is the highest value in Unicode.


2 Answers

Use the Unidecode package to transliterate the string.

>>> import unidecode >>> unidecode.unidecode(u'Gavin O’Connor') "Gavin O'Connor" 
like image 61
Petr Viktorin Avatar answered Sep 22 '22 07:09

Petr Viktorin


import unicodedata  unicode_string = u"Gavin O’Connor" print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore') 

Output:

 Gavin O'Connor 

Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/

like image 44
Acorn Avatar answered Sep 23 '22 07:09

Acorn