Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Django double encoding a Unicode (utf-8?) string?

I'm having trouble storing and outputting an ndash character as UTF-8 in Django.

I'm getting data from an API. In raw form, as retrieved and viewed in a text editor, given unit of data may be similar to:

"I love this detergent \u2013 it is so inspiring." 

(\u2013 is & ndash; as an html entity).

If I get this straight from an API and display it in Django, no problem. It displays in my browser as a long dash. I noticed I have to do decode('utf-8') to avoid the "'ascii' codec can't encode character" error if I try to do some operations with that text in my view, though. The text is going to the template as "I love this detergent\u2013 it is so inspiring.", according to the Django Debug Toolbar.

When stored to MySQL and read for output through the same view and template, however, it ends up looking like

"I love this detergent – it is so inspiring"

My MySQL table is set to DEFAULT CHARSET=utf8.

Now, when I read the data from the database through the MysQl monitor in a terminal set to Utf-8, it shows up as

"I love this detergent – it is so inspiring" 

(correct - shows an ndash)

When I use mysqldb in a python shell, this line is

"I love this detergent \xe2\x80\x93 it is so inspiring" 

(this is the correct UTF-8 for an ndash)

However, if I run python manage.py shell, and then

In [1]: import myproject.myapp.models ThatTable
In [2]: msg=ThatTable.objects.all().filter(thefield__contains='detergent')
In [3]: msg
Out[4]: [{'thefield': 'I love this detergent \xc3\xa2\xe2\x82\xac\xe2\x80\x9c it is so inspiring'}]

It appears to me that Django has taken \xe2\x80\x93 to mean three separate characters, and encoded it as UTF-8 into \xc3\xa2\xe2\x82\xac\xe2\x80\x9c. This displays as – because \xe2 appears to be â, \x80 appears to be €, etc. I've checked and this is how it's being sent to the template, as well.

If you decode the long sequence in Python, though, with decode('utf-8'), the result is \xe2\u20ac\u201c which also renders in the browser as –. Trying to decode it again yields a UnicodeDecodeError.

I've followed the Django suggestions for Unicode, as far as I know (configured MySQL).

Any suggestions on what I may have misconfigured?

addendum It seems this same issue has cropped up in other areas or systems as well., as while searching for \xc3\xa2\xe2\x82\xac\xe2\x80\x9c, I found at http://pastie.org/908443.txt a script to 'repair bad UTF8 entities.', also found in a wordpress RSS import plug in. It simply replaces this sequence with –. I'd like to solve this the right way, though!

Oh, and I'm using Django 1.2 and Python 2.6.5.

I can connect to the same database with PHP/PDO and print out this data without doing anything special, and it looks fine.

like image 729
JAL Avatar asked Jun 04 '10 05:06

JAL


People also ask

Is Python a UTF-8 string?

Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.

Is UTF-8 a string?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What type of encoding is UTF-8?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.


1 Answers

This does seem like a case of double-encoding; I don't have much experience with Python, but try adjusting the MySQL connection settings as per the advice at http://tahpot.blogspot.com/2005/06/mysql-and-python-and-unicode.html

What I'm guessing is happening is that the connection is latin1, so MySQL tries to encode the string again before storage to the UTF-8 field. The code there, specifically this bit:

EDIT: With Python when establishing a database connection add the following flag: init_command='SET NAMES utf8'.

In addition set the following in MySQL's my.cnf: default-character-set = utf8

is probably what you want.

like image 138
phsource Avatar answered Sep 28 '22 07:09

phsource