Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using pyodbc on linux to insert unicode or utf-8 chars in a nvarchar mssql field

Tags:

I am using Ubuntu 9.04

I have installed the following package versions:

unixodbc and unixodbc-dev: 2.2.11-16build3 tdsodbc: 0.82-4 libsybdb5: 0.82-4 freetds-common and freetds-dev: 0.82-4 

I have configured /etc/unixodbc.ini like this:

[FreeTDS] Description             = TDS driver (Sybase/MS SQL) Driver          = /usr/lib/odbc/libtdsodbc.so Setup           = /usr/lib/odbc/libtdsS.so CPTimeout               =  CPReuse         =  UsageCount              = 2 

I have configured /etc/freetds/freetds.conf like this:

[global]     tds version = 8.0     client charset = UTF-8 

I have grabbed pyodbc revision 31e2fae4adbf1b2af1726e5668a3414cf46b454f from http://github.com/mkleehammer/pyodbc and installed it using "python setup.py install"

I have a windows machine with Microsoft SQL Server 2000 installed on my local network, up and listening on the local ip address 10.32.42.69. I have an empty database created with name "Common". I have the user "sa" with password "secret" with full priviledges.

I am using the following python code to setup the connection:

import pyodbc odbcstring = "SERVER=10.32.42.69;UID=sa;PWD=secret;DATABASE=Common;DRIVER=FreeTDS" con = pyodbc.connect(s) cur = con.cursor() cur.execute(''' CREATE TABLE testing (     id INTEGER NOT NULL IDENTITY(1,1),      name NVARCHAR(200) NULL,      PRIMARY KEY (id) )     ''') con.commit() 

Everything WORKS up to this point. I have used SQLServer's Enterprise Manager on the server and the new table is there. Now I want to insert some data on the table.

cur = con.cursor() cur.execute('INSERT INTO testing (name) VALUES (?)', (u'something',)) 

That fails!! Here's the error I get:

pyodbc.Error: ('HY004', '[HY004] [FreeTDS][SQL Server]Invalid data type  (0) (SQLBindParameter)' 

Since my client is configured to use UTF-8 I thought I could solve by encoding data to UTF-8. That works, but then I get back strange data:

cur = con.cursor() cur.execute('DELETE FROM testing') cur.execute('INSERT INTO testing (name) VALUES (?)', (u'somé string'.encode('utf-8'),)) con.commit() # fetching data back cur = con.cursor() cur.execute('SELECT name FROM testing') data = cur.fetchone() print type(data[0]), data[0] 

That gives no error, but the data returned is not the same data sent! I get:

<type 'unicode'> somé string 

That is, pyodbc won't accept an unicode object directly, but it returns unicode objects back to me! And the encoding is being mixed up!

Now for the question:

I want code to insert unicode data in a NVARCHAR and/or NTEXT field. When I query back, I want the same data I inserted back.

That can be by configuring the system differently, or by using a wrapper function able to convert the data correctly to/from unicode when inserting or retrieving

That's not asking much, is it?

like image 586
nosklo Avatar asked Jun 03 '09 20:06

nosklo


People also ask

Is nvarchar an UTF-8?

nvarchar uses 2-16bit word for each character, and UTF8 uses between 1 and 4 8bit words. nvarchar uses UCS-2 encoding, and cannot encode the entire unicode code points. UTF-8 can encode the entirety of unicode.

Does SQL Server support UTF-8?

SQL Server 2019 introduces support for the widely used UTF-8 character encoding. This has been a longtime requested feature and can be set as a database-level or column-level default encoding for Unicode string data.

Does SQL Server support Unicode?

SQL Server has long supported Unicode characters in the form of nchar, nvarchar, and ntext data types, which have been restricted to UTF-16.

What character encoding does SQL Server use?

The default character encoding for a SQL Server database is iso_1, which is ISO 8859-1. Note that the character encoding depends on the data type of a column.


1 Answers

I can remember having this kind of stupid problems using odbc drivers, even if that time it was a java+oracle combination.

The core thing is that odbc driver apparently encodes the query string when sending it to the DB. Even if the field is Unicode, and if you provide Unicode, in some cases it does not seem to matter.

You need to ensure that what is sent by the driver has the same encoding as your Database (not only server, but also database). Otherwise, of course you get funky characters because either the client or the server is mixing things up when encoding/or decoding. Do you have any idea of the charset (codepoint as MS like to say) that your server is using as a default for decoding data?

Collation has nothing to do with this problem :)

See that MS page for example. For Unicode fields, collation is used only to define the sort order in the column, not to specify how the data is stored.

If you store your data as Unicode, there is an Unique way to represent it, that's the purpose of Unicode: no need to define a charset that is compatible with all the languages that you are going to use :)

The question here is "what happens when I give data to the server that is not Unicode?". For example:

  • When I send an UTF-8 string to the server, how does it understand it?
  • When I send an UTF-16 string to the server, how does it understand it?
  • When I send a Latin1 string to the server, how does it understand it?

From the server perspective, all these 3 strings are only a stream of bytes. The server cannot guess the encoding in which you encoded them. Which means that you will get troubles if your odbc client ends up sending bytestrings (an encoded string) to the server instead of sending unicode data: if you do so, the server will use a predefined encoding (that was my question: what encoding the server will use? Since it is not guessing, it must be a parameter value), and if the string had been encoded using a different encoding, dzing, data will get corrupted.

It's exactly similar as doing in Python:

uni = u'Hey my name is André' in_utf8 = uni.encode('utf-8') # send the utf-8 data to server # send(in_utf8)  # on server side # server receives it. But server is Japanese. # So the server treats the data with the National charset, shift-jis: some_string = in_utf8 # some_string = receive()     decoded = some_string.decode('sjis') 

Just try it. It's fun. The decoded string is supposed to be "Hey my name is André", but is "Hey my name is Andrテゥ". é gets replaced by Japanese テゥ

Hence my suggestion: you need to ensure that pyodbc is able to send directly the data as Unicode. If pyodbc fails to do this, you will get unexpected results.

And I described the problem in the Client to Server way. But the same sort of issues can arise when communicating back from the Server to the Client. If the Client cannot understand Unicode data, you'll likely get into troubles.

FreeTDS handles Unicode for you.

Actually, FreeTDS takes care of things for you and translates all the data to UCS2 unicode. (Source).

  • Server <--> FreeTDS : UCS2 data
  • FreeTDS <--> pyodbc : encoded strings, encoded in UTF-8 (from /etc/freetds/freetds.conf)

So I would expect your application to work correctly if you pass UTF-8 data to pyodbc. In fact, as this django-pyodbc ticket states, django-pyodbc communicates in UTF-8 with pyodbc, so you should be fine.

FreeTDS 0.82

However, cramm0 says that FreeTDS 0.82 is not completely bugfree, and that there are significant differences between 0.82 and the official patched 0.82 version that can be found here. You should probably try using the patched FreeTDS


Edited: removed old data, which had nothing to do with FreeTDS but was only relevant to Easysoft commercial odbc driver. Sorry.

like image 64
Nicolas Dumazet Avatar answered Sep 24 '22 04:09

Nicolas Dumazet