Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: why does str() on some text from a UTF-8 file give a UnicodeDecodeError?

I'm processing a UTF-8 file in Python, and have used simplejson to load it into a dictionary. However, I'm getting a UnicodeDecodeError when I try to turn one of the dictionary values into a string:

f = open('my_json.json', 'r')
master_dictionary = json.load(f)
#some json wrangling, then it fails on this line...
mysql_string += " ('" + str(v_dict['code'])
Traceback (most recent call last):
  File "my_file.py", line 25, in <module>
    str(v_dict['code']) + "'), "
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 35: ordinal not in range(128)

Why is Python even using ASCII? I thought it used UTF-8 by default, and the input is from a UTF-8 file.

$ file my_json.json 
my_json.json: UTF-8 Unicode English text

What is the problem?

like image 648
AP257 Avatar asked Mar 31 '10 16:03

AP257


People also ask

How do I fix UnicodeDecodeError in Python?

The Python "UnicodeDecodeError: 'ascii' codec can't decode byte in position" occurs when we use the ascii codec to decode bytes that were encoded using a different codec. To solve the error, specify the correct encoding, e.g. utf-8 .

What does UnicodeDecodeError mean?

The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.

What is UTF-8 codec can't decode byte?

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).

What is the difference between ISO 8859 1 and UTF-8?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.


1 Answers

Python 2.x uses ASCII by default. Use unicode.encode() if you want to turn a unicode into a str:

v_dict['code'].encode('utf-8')
like image 170
Ignacio Vazquez-Abrams Avatar answered Oct 19 '22 21:10

Ignacio Vazquez-Abrams