Converting to UTF-8 (again)

Question

I've this string Traor\u0102\u0160

Traor\u0102\u0160 Should produce TraorÃ©. Then TraorÃ© utf-8 decoded should produce Traorè

How I can convert it to Traorè ?

What kind of chars are Traor\u0102\u0160? Unicode?

I've already read this http://docs.python.org/howto/unicode.html#encodings many times. But I'm still really confused.

I get this data with the following request:

import json
import requests

# making a request to get this json
r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json')
print r.json

Solution

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import json
import requests

headers = {'Content-Type': 'application/json'}

r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json', headers=headers)


print r.content

#prints
{"Item":{"FirstName":"Lacina","LastName":"Traoré","CommonName":null,"Height":"203","DateOfBirth":{"Year":"1990","Month":"8","Day":"20"},"PreferredFoot":"Left","ClubId":"100766","LeagueId":"67","NationId":"108","Rating":"78","Attribute1":"79","Attribute2":"71","Attribute3":"45","Attribute4":"69","Attribute5":"50","Attribute6":"72","Rare":"1","ItemType":"PlayerA"}}

Basically I needed to set to send the rigth headers.

Thank you all

Burhan Khalid · Accepted Answer

You need tell requests what encoding to expect:

>>> import requests
>>> r = requests.get(url)
>>> r.encoding = 'UTF-8'
>>> r.json[u'Item'][u'LastName']
u'Traor\xe9'

Otherwise, you'll get this:

>>> r = requests.get(url)
>>> r.json['Item']['LastName']
u'Traor\u0102\u0160'

Martijn Pieters · Answer

You have run into a bug in requests; when the server does not set an explicit encoding, requests uses chardet to make an educated guess about the encoding.

In this particular case, it gets that wrong; chardet thinks it's ISO-8859-2 instead of UTF-8. The issue has been reported to the maintainers of requests as issue 765.

The maintainers closed that issue, blaming the problem on the server not setting a character encoding for the response. The work-around is to set r.encoding = 'utf-8' before accessing r.json so that the contents are correctly decoded without guessing.

However, as J.F. Sebastian correctly points out, if the response really is JSON, then the encoding has to be one of the UTF family of encodings. The JSON RFC even includes a section on how to detect what encoding was used.

I've submitted a pull request to the requests project that does just that; if you ask for the JSON decoded response, and no encoding has been set, it'll detect the correct UTF encoding used instead of guessing.

With this patch in place, the URL loads without setting the encoding explicitly:

>>> import requests
>>> r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json')
>>> r.json[u'Item'][u'LastName']
u'Traor\xe9'
>>> print r.json[u'Item'][u'LastName']
Traoré

Converting to UTF-8 (again)

Tags:

python

character-encoding

utf-8

Solution

gaggina

2 Answers

Burhan Khalid

Martijn Pieters

Recent Activity

Donate For Us

Converting to UTF-8 (again)

Tags:

python

character-encoding

utf-8

Solution

gaggina

2 Answers

Burhan Khalid

Martijn Pieters

Related questions

Recent Activity

Donate For Us