Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Convert Unicode to ASCII without errors for CSV file

I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"

buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
    row = cr.fetchone()
    writer.writerow([s.encode('ascii','ignore') for s in row])

The value of row is

(56, u"LIMPIADOR BA\xd1O 1'5 L")

where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).

I'd like to save the value to the CSV, preferably with the ñ ("LIMPIADOR BAÑO 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").

like image 938
Sergi Avatar asked Jan 10 '11 19:01

Sergi


People also ask

How do you change Unicode to ASCII in Python?

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts.

How do I ignore Unicode in Python?

You can use String's encode() method with encoding as ascii and error as ignore to remove unicode "u" from String in python. That's all about how to remove unicode characters from String in Python.

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

How do I convert Unicode to letter in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.


1 Answers

Correct, ñ is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here: What is the best way to remove accents in a Python unicode string?

But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.

So just switch "ascii" for whatever encoding you want.


Your actual code, according to your comment is:

writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row]) 

where

row = (56, u"LIMPIADOR BA\xd1O 1'5 L")

Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:

col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row) 

Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.

like image 163
Lennart Regebro Avatar answered Nov 14 '22 22:11

Lennart Regebro