Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding calling from pyodbc to a MS SQL Server

I am connecting to a MS SQL server through SQL Alchemy, using pyodbc module. Everything appears to be working fine, until I began having problems with the encodings. Some of the non-ascii characters are being replaced with '?'

The DB has a collation 'Latin1_General_CI_AS' (I've checked also the specific fields and they keep the same collation). I started selecting the encoding 'latin1' in the call of create_engine and that appears to work for Western European character (like French or Spanish, characters like é) but not for Easter European characters. Specifically, I have a problem with the character ć

I have been trying to select other encodings as stated on Python documentation, specifically the Microsoft ones, like cp1250 and cp1252, but I keep facing the same problem.

Does anyone knows how to solve those differences? Does the collation 'Latin1_General_CI_AS' has an equivalence on Python encodings?

The code for my current connection is the following

for sqlalchemy import *

def connect():
    return pyodbc.connect('DSN=database;UID=uid;PWD=password')

engine = create_engine('mssql://', creator=connect, encoding='latin1')
connection = engine.connect()

Clarifications and comments:

  • This problems happens when retrieving information from the DB. I don't need to store anything.
  • At the beginning I didn't specify the encoding, and the result was that, whenever a non ascii character was encountered on the DB, pyodbc raises a UnicodeDecodeError. I corrected that using 'latin1' as encoding, but that doesn't solve the problem for all the characters.
  • I admit that the server is not on latin1, the comment is incorrect. I have been checking both the database collation and the specific fields collations and appears to be all in 'Latin1_General_CI_AS', then, how can ć be stored? Maybe I'm not correctly understanding collations.
  • I corrected a little the question, specifically, I have tried more encodings than latin1, also cp1250 and cp1252 (which apparently is the one used on 'Latin1_General_CI_AS', according to msdn)

UPDATE:

OK, Following these steps, I get that the encoding used by the DB appears to be cp1252: http://bytes.com/topic/sql-server/answers/142972-characters-encoding Anyway, that appears to be a bad assumption as reflected on answers.

UPDATE2: Anyway, after configuring properly the odbc driver, I don't need to specify the encoding on the Python code.

like image 563
Khelben Avatar asked Sep 20 '10 11:09

Khelben


3 Answers

You should stop using code pages and switch to Unicode. This is the only way of getting rid of this kind of problems.

like image 147
sorin Avatar answered Nov 09 '22 23:11

sorin


Original comment turned into an answer:

cp1250 and cp1252 are NOT "latin1 encodings". A collation is not an encoding. Re your comment: Who says that "the server is encoded in latin1"? If the server expects all input/output to be encoded in latin1 (which I doubt), then you quite simply can't get some Eastern European characters into your database (nor Russian, Chinese, Greek, etc etc).

Update:

You need to look further afield than the collation. """msdn.microsoft.com/en-us/library/ms174596(v=SQL.90).aspx suggests, for Latin1_General_CI_AS the used encoding is cp1252""" is codswallop. The table provides an LCID (locale ID), default collation, and codepage for each locale. Yes, the collation "Latin1_General_CI_AS" is listed in association with the cp1252 codepage for several locales. For two locales (Armenian and Georgian), it is listed in association with the "Unicode" codepage (!!!).

Quite simply, you need to find out what codepage the database is using.

Try to extract data from the database without specifing an encoding at all. Don't bother encoding that in whatever encoding you guess your console may be using -- that only adds another source of confusion. Instead, use print repr(data). Report back here what you get from the repr() where you expect non-Latin1 characters.

like image 45
John Machin Avatar answered Nov 09 '22 21:11

John Machin


Try connecting to the db with the pyodbc.connect() parameter convert_unicode=True , eg. from sqlalchemy:

engine = create_engine('mssql://yourdb', connect_args={'convert_unicode': True})

This should make sure that all the results (and not only those from nvarchar etc...) you get are unicode, correctly converted from whatever encoding is used in the db.

As for writing to the db, just always use unicode. If I'm not mistaken (will check later), pyodbc will make sure it will get written to the db correctly as well.

(of course, if the db uses an encoding that does not support the characters you want to write, you will still get errors: if you want the columns to support any kind of character, you will have to use unicode columns on the db too)

like image 1
Steven Avatar answered Nov 09 '22 22:11

Steven