Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the cause of this UnicodeDecodeError with an nvarchar field using pyodbc and MSSQL?

I can read from a MSSQL database by sending queries in python through pypyodbc.

Mostly unicode characters are handled correctly, but I've hit a certain character that causes an error.

The field in question is of type nvarchar(50) and begins with this character "􀄑" which renders for me a bit like this...

-----
|100|
|111| 
-----

If that number is hex 0x100111 then it's the character supplementary private use area-b u+100111. Though interestingly, if it's binary 0b100111 then it's an apostrophe, could it be that the wrong encoding was used when the data was uploaded? This field is storing part of a Chinese postal address.

The error message includes

UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here it is in full...

Traceback (most recent call last):   File "question.py", line 19, in <module>
    results.fetchone()   File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone
    value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer)))   File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec
    uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding)   File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here's some minimal reproducing code...

import pypyodbc

connection_string = (
    "DSN=sqlserverdatasource;"
    "UID=REDACTED;"
    "PWD=REDACTED;"
    "DATABASE=obi_load")

connection = pypyodbc.connect(connection_string)

cursor = connection.cursor()

query_sql = (
    "SELECT address_line_1 "
    "FROM address "
    "WHERE address_id == 'REDACTED' ")

with cursor.execute(query_sql) as results:
    row = results.fetchone() # This is the line that raises the error.
    print row

Here is a chunk of my /etc/freetds/freetds.conf

[global]
;   tds version = 4.2
;   dump file = /tmp/freetds.log
;   debug flags = 0xffff
;   timeout = 10
;   connect timeout = 10
    text size = 64512

[sqlserver]
host = REDACTED
port = 1433
tds version = 7.0
client charset = UTF-8

I've also tried with client charset = UTF-16 and omitting that line all together.

Here's the relevant chunk from my /etc/odbc.ini

[sqlserverdatasource]
Driver = FreeTDS
Description = ODBC connection via FreeTDS
Trace = No
Servername = sqlserver
Database = REDACTED

Here's the relevant chunk from my /etc/odbcinst.ini

[FreeTDS]
Description = TDS Driver (Sybase/MS SQL)
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 1

I can work around this issue by fetching results in a try/except block, throwing away any rows that raise a UnicodeDecodeError, but is there a solution? Can I throw away just the undecodable character, or is there a way to fetch this line without raising an error?

It's not inconceivable that some bad data has ended up on the database.

I've Googled around and checked this site's related questions, but have had no luck.

like image 571
vowel-house-might Avatar asked Apr 18 '16 12:04

vowel-house-might


1 Answers

I fixed the issue myself by using this:

conn.setencoding('utf-8')

immediately before creating a cursor.

Where conn is the connection object.

I was fetching tens of millions of rows with fetchall(), and in the middle of a transaction that would be extremely expensive to undo manually, so I couldn't afford to simply skip invalid ones.

Source where I found the solution: https://github.com/mkleehammer/pyodbc/issues/112#issuecomment-264734456

like image 80
Max Candocia Avatar answered Nov 08 '22 17:11

Max Candocia