Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

django countries encoding is not giving correct name

I am using django_countries module for countries list, the problem is there are couple of countries with special characters like 'Åland Islands' and 'Saint Barthélemy'.

I am calling this method to get the country name:

country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name

I know that country_label is lazy translated proxy object of django utils, but it is not giving the right name rather it gives 'Ã…land Islands'. any suggestions for this please?

like image 299
Maverick Avatar asked Jun 04 '15 07:06

Maverick


1 Answers

Django stores unicode string using code points and identifies the string as unicode for further processing. UTF-8 uses four 8-bit bytes encoding, so the unicode string that's being used by Django needs to be decoded or interpreted from code point notation to its UTF-8 notation at some point. In the case of Åland Islands, what seems to be happening is that it's taking the UTF-8 byte encoding and interpret it as code points to convert the string.

The string django_countries returns is most likely u'\xc5land Islands' where \xc5 is the UTF code point notation of Å. In UTF-8 byte notation \xc5 becomes \xc3\x85 where each number \xc3 and \x85 is a 8-bit byte. See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc5&mode=hex

Or you can use country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8') to go from u'\xc5land Islands' to '\xc3\x85land Islands'

If you take then each byte and use them as code points, you'll see it'll give you these characters: Ã… See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc3&mode=hex And: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=x85&mode=hex

See code snippet with html notation of these characters.

<div id="test">&#xC3;&#x85;&#xC5;</div>

So I'm guessing you have 2 different encodings in you application. One way to get from u'\xc5land Islands' to u'\xc3\x85land Islands' would be to in an utf-8 environment encode to UTF-8 which would convert u'\xc5' to '\xc3\x85' and then decode to unicode from iso-8859 which would give u'\xc3\x85land Islands'. But since it's not in the code you're providing, I'm guessing it's happening somewhere between the moment you set country_label and the moment your output isn't displayed properly. Either automatically because of encodings settings, or through an explicit assignation somewhere.

FIRST EDIT:

To set encoding for you app, add # -*- coding: utf-8 -*- at the top of your py file and <meta charset="UTF-8"> in of your template. And to get unicode string from a django.utils.functional.proxy object you can call unicode(). Like this:

country_label = unicode(fields.Country(form.cleaned_data.get('country')[0:2]).name)

SECOND EDIT:

One other way to figure out where the problem is would be to use force_bytes (https://docs.djangoproject.com/en/1.8/ref/utils/#module-django.utils.encoding) Like this:

from django.utils.encoding import force_bytes
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
forced_country_label = force_bytes(country_label, encoding='utf-8', strings_only=False, errors='strict') 

But since you already tried many conversions without success, maybe the problem is more complex. Can you share your version of django_countries, Python and your django app language settings? What you can do also is go see directly in your djano_countries package (that should be in your python directory), find the file data.py and open it to see what it looks like. Maybe the data itself is corrupted.

like image 176
Julien Grégoire Avatar answered Oct 12 '22 01:10

Julien Grégoire