Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is Python's "best ASCII for this Unicode" database? [closed]

I have some text that uses Unicode punctuation, like left double quote, right single quote for apostrophe, and so on, and I need it in ASCII. Does Python have a database of these characters with obvious ASCII substitutes so I can do better than turning them all into "?" ?

like image 675
joeforker Avatar asked May 03 '09 03:05

joeforker


People also ask

Does Python use ASCII or Unicode?

1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.

How do I find Unicode in Python?

There is a bytes type that holds raw bytes. This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

How does Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

What is Unicode data in Python?

Unicode Character Database (UCD) is defined by Unicode Standard Annex #44 which defines the character properties for all unicode characters. This module provides access to UCD and uses the same symbols and names as defined by the Unicode Character Database. Functions defined by the module : unicodedata.lookup(name)


1 Answers

Unidecode looks like a complete solution. It converts fancy quotes to ascii quotes, accented latin characters to unaccented and even attempts transliteration to deal with characters that don't have ASCII equivalents. That way your users don't have to see a bunch of ? when you had to pass their text through a legacy 7-bit ascii system.

>>> from unidecode import unidecode >>> print unidecode(u"\u5317\u4EB0") Bei Jing  

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

like image 71
joeforker Avatar answered Oct 17 '22 02:10

joeforker