Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Chinese garbled when use webpy but it's normal when use MySQLdb?

I create a database in mysql and use webpy to construct my web server.

But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.

Below is my problem:

My table t_test (utf8 databse):

id     name
1      测试

the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95

when using MySQLdb to do "select" like this:

    c=conn.cursor()
    c.execute("SELECT * FROM t_test")
    items = c.fetchall()
    c.close()
    print "items=%s, name=%s"%(eval_items, eval_items[1])

the result is normal, it prints:

    items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试

But when I use webpy do the same things:

    db = web.database(dbn='mysql', host="127.0.0.1", 
             user='test', pw='test', db='db_test', charset="utf8")
    eval_items=db.select('t_test')
    comment=eval_items[0].name
    print "comment code=%s"%repr(comment)
    print "comment=%s"%comment.encode("utf8")

Chinese garble occured, the print result is:

    comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
    comment=忙碌鈥姑€

I know webpy's database is also dependent on MySQLdb, but it's so different for these two way. Why?

BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?

like image 883
eason Avatar asked Nov 07 '12 10:11

eason


1 Answers

Indeed, something very wrong is taking place -- as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 - which works on my utf-8 terminal here:

>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试

But look at the bytes on your "comment" unicode object:

comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'

Meaning part of your content are the utf-8 bytes for the comment (the chars represented as "\xYY" and part is encoded as Unicode points (the chares represented with \uYYYY ) - this indicates serious garbage.

MySQL has some catchs to proper decode (utf-8 or otherwise) encoded text in it - one of which is passing a proper "charset" parameter to the connection. But you did that already -

One attempt you can do is to pass the connection the option use_unicode=False - and decode the utf-8 strings in your own code.

db = web.database(dbn='mysql', host="127.0.0.1", 
         user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)

Check the options to connect for this and other parameters you might try:

http://mysql-python.sourceforge.net/MySQLdb.html

Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects) in your encoded string are encoded in one of these encodings: ("cp1258", "cp1252", "palmos", "cp1254")

Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses if it does not get the "charset" argument in the connection. But it is not only a matter of web2py not passing it to the database, as you are getting mangled chars, not just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors

From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:

comment = comment.encode("cp1252", errors="ignore")

So, placing this line might work for you now, but guessing around with unicode is never good - the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.

update

I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode.  It is probably a web2py error, like, it does not pass your "charset=utf8"  parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the '\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).

all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)

** update 2 ** I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.

like image 196
jsbueno Avatar answered Oct 24 '22 21:10

jsbueno