Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python & MySql: Unicode and Encoding

I am parsing json data and trying to store some of the json data into Mysql database. I am currently getting following unicode error. My question is how should I handle this.

  • Should I handle it from the database side, and if so how can I modify my table to do so?
  • Should I handle it from python side?

Here is my table structure

CREATE TABLE yahoo_questions (    question_id varchar(40) NOT NULL,     question_subj varbinary(255),     question_content varbinary(255),    question_userId varchar(40) NOT NULL,    question_timestamp varchar(40),    category_id varbinary(20) NOT NULL,    category_name varchar(40) NOT NULL,    choosen_answer varbinary(255),    choosen_userId varchar(40),    choosen_usernick varchar(40),    choosen_ans_timestamp varchar(40),    UNIQUE (question_id) ); 

Error While inserting via python code:

Traceback (most recent call last):   File "YahooQueryData.py", line 78, in <module>     +"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp))   File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/cursors.py", line 159, in execute     query = query % db.literal(args)   File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/connections.py", line 264, in literal     return self.escape(o, self.encoders)   File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/connections.py", line 202, in unicode_literal     return db.literal(u.encode(unicode_literal.charset)) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 204-230: ordinal not in range(256) 

Python Code segment:

    #pushing user id to the url to get full json stack     urlobject = urllib.urlopen(base_url.format(row[2]))     qnadatajson = urlobject.read()     data = json.loads(qnadatajson) cur.execute("INSERT INTO yahoo_questions (question_id, question_subj, question_content, question_userId, question_timestamp,"             +"category_id, category_name, choosen_answer, choosen_userId, choosen_usernick, choosen_ans_timestamp)"             +"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp)) 

json Structure

questions: [ { Id: "20111201185322AA5HTDc", Subject: "what are the new pokemon call?", Content: "I used to know them I stop at dialga and palkia version and I heard there's new ones what's it call ", Date: "2011-12-01 18:53:22", Timestamp: "1322794402", 

What I also did prior to running the query I execute the following on mysql SET character_set_client = utf8

And this how the mysql variables looks like:

mysql> SHOW variables LIKE '%character_set%'; +--------------------------+--------------------------------------------------------+ | Variable_name            | Value                                                  | +--------------------------+--------------------------------------------------------+ | character_set_client     | utf8                                                   | | character_set_connection | utf8                                                   | | character_set_database   | latin1                                                 | | character_set_filesystem | binary                                                 | | character_set_results    | utf8                                                   | | character_set_server     | latin1                                                 | | character_set_system     | utf8                                                   | | character_sets_dir       | /usr/local/mysql-5.5.10-osx10.6-x86_64/share/charsets/ | +--------------------------+--------------------------------------------------------+ 8 rows in set (0.00 sec) 
like image 251
add-semi-colons Avatar asked Dec 03 '11 05:12

add-semi-colons


People also ask

What is Python used for?

Python is a computer programming language often used to build websites and software, automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to create a variety of different programs and isn't specialized for any specific problems.

Can a beginner learn Python?

Python is widely considered among the easiest programming languages for beginners to learn. If you're interested in learning a programming language, Python is a good place to start.

Which language is Python written in?

Answer: (b) C Explanation: Python is written in C programming language, and it is also called CPython.

Is Python coding good?

Python is undoubtedly considered a top programming language at the same level as JavaScript or C++, and it's one of the most used languages by businesses and enterprises. Even though it's almost 30 years old, Python is still relevant, given its ease of use, its vibrant community, and many applications.


2 Answers

I think that your MYSQLdb python library doesn't know it's supposed to encode to utf8, and is encoding to the default python system-defined charset latin1.

When you connect() to your database, pass the charset='utf8' parameter. This should also make a manual SET NAMES or SET character_set_client unnecessary.

like image 53
Francis Avila Avatar answered Oct 12 '22 23:10

Francis Avila


First, make sure you are assigning the charset and use_unicode parameters when making your MySQL connection:

conn = mysql.connect(host='127.0.0.1',                      user='user',                      passwd='passwd',                      db='db',                      charset='utf8',                      use_unicode=True) 

Secondly, use prepared statements when actually querying the database. Below is an example INSERT query of a string containing a unicode character.

cursor.execute('INSERT INTO mytable VALUES (null, %s)',                                  ('Some string that contains unicode: ' + unichr(300),)) 
like image 44
danriti Avatar answered Oct 13 '22 00:10

danriti