Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Unicode to ASCII without errors in Python

People also ask

How do you stop Unicode errors in Python?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

How do I convert to ASCII in Python?

chr () is a built-in function in Python that is used to convert the ASCII code into its corresponding character. The parameter passed in the function is a numeric, integer type value. The function returns a character for which the parameter is the ASCII code.

How do you convert a Unicode character to a string in Python?

To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.

Does Python use Unicode or ASCII?

1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.


>>> u'aあä'.encode('ascii', 'ignore')
'a'

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

See https://docs.python.org/3/library/stdtypes.html#str.encode


As an extension to Ignacio Vazquez-Abrams' answer

>>> u'aあä'.encode('ascii', 'ignore')
'a'

It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

Although there are more efficient ways to accomplish this. See this question for more details Where is Python's "best ASCII for this Unicode" database?


2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

Note: In Python 2 you'd use StringIO instead of io

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

Original Answer from 2010:

Can we get the actual value used for link?

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xa0'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).


Use unidecode - it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.

$ pip install unidecode

then:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

I no longer get any unicode errors after using this.


For broken consoles like cmd.exe and HTML output you can always use:

my_unicode_string.encode('ascii','xmlcharrefreplace')

This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.

WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.

And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').