Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Latin-1 and the unicode factory in Python

Tags:

python

unicode

I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode factory, and I don't know how to make Python use a codec other than ascii.

The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.

The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi instead of pyodbc, which would get me Unicode, but adodbapi gives me other problems.)

x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)

t = PrettyTable(columns)
for rec in r:
    t.add_row(rec)
r.close()
x.close()

t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t

But the Name column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc, when it gets to the t.add_row call:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)

This is line 222 in prettytable.py. It uses unicode, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.

for i in range(0,len(row)):
    if len(unicode(row[i])) > self.widths[i]:   # This is line 222
        self.widths[i] = len(unicode(row[i]))

Please tell me what I'm doing wrong here. How can I make unicode work without hacking prettytable.py or any of the other libraries that I use? Is there even a way to do this?

EDIT: The error occurs not at the print statement, but at the t.add_row call.

EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.

x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)

t = PrettyTable(columns)
for rec in r:
    urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
    t.add_row(urec)
r.close()
x.close()

t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')

I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!

like image 320
eksortso Avatar asked Jul 20 '09 20:07

eksortso


People also ask

Why do we use encoding Latin-1 in Python?

The latin-1 encoding in Python implements ISO_8859-1:1987 which maps all possible byte values to the first 256 Unicode code points, and thus ensures decoding errors will never occur regardless of the configured error handler.

What does unicode () do in Python?

Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

How do you code unicode in Python?

To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.

What is ASCII and unicode in Python?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.


2 Answers

Add this at the beginning of the module:

# coding: latin1

Or decode the string to Unicode yourself.

[Edit]

It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:

>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'

[Edit]

Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html

like image 113
Bastien Léonard Avatar answered Oct 16 '22 00:10

Bastien Léonard


Maybe try to decode the latin1-encoded strings into unicode?

t.add_row((value.decode('latin1') for value in rec))
like image 24
liori Avatar answered Oct 16 '22 00:10

liori