I am doing some scripts in python. I create a string that I save in a file. This string got lot of data, coming from the arborescence and filenames of a directory. According to convmv, all my arborescence is in UTF-8. I want to keep everything in UTF-8 because I will save it in MySQL after. For now, in MySQL, which is in UTF-8, I got some problem with some characters (like é or è - I'am French). I want that python always use string as UTF-8. I read some informations on the internet and i did like this. My script begin with this : <pre class="prettyprint"><code> #!/usr/bin/python # -*- coding: utf-8 -*- def createIndex(): import codecs toUtf8=codecs.getencoder('UTF8') #lot of operations & building indexSTR the string who matter findex=open('config/index/music_vibration_'+date+'.index','a') findex.write(codecs.BOM_UTF8) findex.write(toUtf8(indexSTR)) #this bugs! </code></pre> And when I execute, here is the answer : <code>UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)</code> Edit: I see, in my file, the accent are nicely written. After creating this file, I read it and I write it into MySQL. But I dont understand why, but I got problem with encoding. My MySQL database is in utf8, or seems to be SQL query <code>SHOW variables LIKE 'char%'</code> returns me only utf8 or binary. My function looks like this : <pre class="prettyprint"><code>#!/usr/bin/python # -*- coding: utf-8 -*- def saveIndex(index,date): import MySQLdb as mdb import codecs sql = mdb.connect('localhost','admin','*******','music_vibration') sql.charset="utf8" findex=open('config/index/'+index,'r') lines=findex.readlines() for line in lines: if line.find('#artiste') != -1: artiste=line.split('[:::]') artiste=artiste[1].replace('\n','') c=sql.cursor() c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"') nbr=c.fetchone() if nbr[0]==0: c=sql.cursor() iArt+=1 c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8') </code></pre> And artiste who are nicely displayed in the file writes bad into the BDD. What is the problem ?

You don't need to encode data that is already encoded. When you try to do that, Python will first try to decode it to <code>unicode</code> before it can encode it back to UTF-8. That is what is failing here: <pre class="prettyprint"><code>>>> data = u'\u00c3' # Unicode data >>> data = data.encode('utf8') # encoded to UTF-8 >>> data '\xc3\x83' >>> data.encode('utf8') # Try to *re*-encode it Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) </code></pre> Just write your data directly to the file, there is no need to encode already-encoded data. If you instead build up <code>unicode</code> values instead, you would indeed have to encode those to be writable to a file. You'd want to use <code>codecs.open()</code> instead, which returns a file object that will encode unicode values to UTF-8 for you. You also really don't want to write out the UTF-8 BOM, unless you have to support Microsoft tools that cannot read UTF-8 otherwise (such as MS Notepad). For your MySQL insert problem, you need to do two things: <ul> <li>Add <code>charset='utf8'</code> to your <code>MySQLdb.connect()</code> call.</li> <li> Use <code>unicode</code> objects, not <code>str</code> objects when querying or inserting, but use sql parameters so the MySQL connector can do the right thing for you: <pre class="prettyprint"><code>artiste = artiste.decode('utf8') # it is already UTF8, decode to unicode c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,)) # ... c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/')) </code></pre> </li> </ul> It may actually work better if you used <code>codecs.open()</code> to decode the contents automatically instead: <pre class="prettyprint"><code>import codecs sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8') with codecs.open('config/index/'+index, 'r', 'utf8') as findex: for line in findex: if u'#artiste' not in line: continue artiste=line.split(u'[:::]')[1].strip() cursor = sql.cursor() cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,)) if not cursor.fetchone()[0]: cursor = sql.cursor() cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/')) artists_inserted += 1 </code></pre> You may want to brush up on Unicode and UTF-8 and encodings. I can recommend the following articles: <ul> <li>The Python Unicode HOWTO</li> <li>Pragmatic Unicode by Ned Batchelder</li> <li>The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky</li> </ul>

python encoding utf-8

Tags:

python

encoding

unicode

utf-8

I am doing some scripts in python. I create a string that I save in a file. This string got lot of data, coming from the arborescence and filenames of a directory. According to convmv, all my arborescence is in UTF-8.

I want to keep everything in UTF-8 because I will save it in MySQL after. For now, in MySQL, which is in UTF-8, I got some problem with some characters (like é or è - I'am French).

I want that python always use string as UTF-8. I read some informations on the internet and i did like this.

My script begin with this :

 #!/usr/bin/python  # -*- coding: utf-8 -*-  def createIndex():      import codecs      toUtf8=codecs.getencoder('UTF8')      #lot of operations & building indexSTR the string who matter      findex=open('config/index/music_vibration_'+date+'.index','a')      findex.write(codecs.BOM_UTF8)      findex.write(toUtf8(indexSTR)) #this bugs!

And when I execute, here is the answer : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

Edit: I see, in my file, the accent are nicely written. After creating this file, I read it and I write it into MySQL. But I dont understand why, but I got problem with encoding. My MySQL database is in utf8, or seems to be SQL query SHOW variables LIKE 'char%' returns me only utf8 or binary.

My function looks like this :

#!/usr/bin/python # -*- coding: utf-8 -*-  def saveIndex(index,date):     import MySQLdb as mdb     import codecs      sql = mdb.connect('localhost','admin','*******','music_vibration')     sql.charset="utf8"     findex=open('config/index/'+index,'r')     lines=findex.readlines()     for line in lines:         if line.find('#artiste') != -1:             artiste=line.split('[:::]')             artiste=artiste[1].replace('\n','')              c=sql.cursor()             c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')             nbr=c.fetchone()             if nbr[0]==0:                 c=sql.cursor()                 iArt+=1                 c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

And artiste who are nicely displayed in the file writes bad into the BDD. What is the problem ?

880

asked Feb 26 '13 15:02

vekah

1 Answers

You don't need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode before it can encode it back to UTF-8. That is what is failing here:

>>> data = u'\u00c3'            # Unicode data >>> data = data.encode('utf8')  # encoded to UTF-8 >>> data '\xc3\x83' >>> data.encode('utf8')         # Try to *re*-encode it Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Just write your data directly to the file, there is no need to encode already-encoded data.

If you instead build up unicode values instead, you would indeed have to encode those to be writable to a file. You'd want to use codecs.open() instead, which returns a file object that will encode unicode values to UTF-8 for you.

You also really don't want to write out the UTF-8 BOM, unless you have to support Microsoft tools that cannot read UTF-8 otherwise (such as MS Notepad).

For your MySQL insert problem, you need to do two things:

Add charset='utf8' to your MySQLdb.connect() call.

Use unicode objects, not str objects when querying or inserting, but use sql parameters so the MySQL connector can do the right thing for you:

artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode  c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))  # ...  c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))

It may actually work better if you used codecs.open() to decode the contents automatically instead:

import codecs  sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')  with codecs.open('config/index/'+index, 'r', 'utf8') as findex:     for line in findex:         if u'#artiste' not in line:             continue          artiste=line.split(u'[:::]')[1].strip()      cursor = sql.cursor()     cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))     if not cursor.fetchone()[0]:         cursor = sql.cursor()         cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))         artists_inserted += 1

You may want to brush up on Unicode and UTF-8 and encodings. I can recommend the following articles:

The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

answered Sep 30 '22 03:09

Martijn Pieters

Related questions
                            
                                Get a dict of all variables currently in scope and their values
                            
                                Applying LIMIT and OFFSET to all queries in SQLAlchemy
                            
                                What is the difference between native int type and the numpy.int types?
                            
                                timeit and its default_timer completely disagree
                            
                                Subclassing Python dictionary to override __setitem__
                            
                                Why was PyPI called the cheese shop?
                            
                                How to reference python package when filename contains a period
                            
                                Is it safe to use sys.platform=='win32' check on 64-bit Python?
                            
                                How to get text in QlineEdit when QpushButton is pressed in a string?
                            
                                Keep plotting window open in Matplotlib
                            
                                Is it bad form to call a classmethod as a method from an instance?
                            
                                Using flask inside class
                            
                                When are parentheses required around a tuple?
                            
                                How do I create a date picker in tkinter?
                            
                                Colour chart for Tkinter and Tix
                            
                                How to define free-variable in python?
                            
                                What is a Python bytestring?
                            
                                Python assignment destructuring
                            
                                R summary() equivalent in numpy
                            
                                Is there anything like VirtualEnv for Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With