I can read from a MSSQL database by sending queries in python through pypyodbc.
Mostly unicode characters are handled correctly, but I've hit a certain character that causes an error.
The field in question is of type nvarchar(50)
and begins with this character "" which renders for me a bit like this...
-----
|100|
|111|
-----
If that number is hex 0x100111
then it's the character supplementary private use area-b u+100111
. Though interestingly, if it's binary 0b100111
then it's an apostrophe, could it be that the wrong encoding was used when the data was uploaded? This field is storing part of a Chinese postal address.
The error message includes
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
Here it is in full...
Traceback (most recent call last): File "question.py", line 19, in <module>
results.fetchone() File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone
value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer))) File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec
uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding) File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
Here's some minimal reproducing code...
import pypyodbc
connection_string = (
"DSN=sqlserverdatasource;"
"UID=REDACTED;"
"PWD=REDACTED;"
"DATABASE=obi_load")
connection = pypyodbc.connect(connection_string)
cursor = connection.cursor()
query_sql = (
"SELECT address_line_1 "
"FROM address "
"WHERE address_id == 'REDACTED' ")
with cursor.execute(query_sql) as results:
row = results.fetchone() # This is the line that raises the error.
print row
Here is a chunk of my /etc/freetds/freetds.conf
[global]
; tds version = 4.2
; dump file = /tmp/freetds.log
; debug flags = 0xffff
; timeout = 10
; connect timeout = 10
text size = 64512
[sqlserver]
host = REDACTED
port = 1433
tds version = 7.0
client charset = UTF-8
I've also tried with client charset = UTF-16
and omitting that line all together.
Here's the relevant chunk from my /etc/odbc.ini
[sqlserverdatasource]
Driver = FreeTDS
Description = ODBC connection via FreeTDS
Trace = No
Servername = sqlserver
Database = REDACTED
Here's the relevant chunk from my /etc/odbcinst.ini
[FreeTDS]
Description = TDS Driver (Sybase/MS SQL)
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 1
I can work around this issue by fetching results in a try/except block, throwing away any rows that raise a UnicodeDecodeError
, but is there a solution? Can I throw away just the undecodable character, or is there a way to fetch this line without raising an error?
It's not inconceivable that some bad data has ended up on the database.
I've Googled around and checked this site's related questions, but have had no luck.
I fixed the issue myself by using this:
conn.setencoding('utf-8')
immediately before creating a cursor.
Where conn
is the connection object.
I was fetching tens of millions of rows with fetchall()
, and in the middle of a transaction that would be extremely expensive to undo manually, so I couldn't afford to simply skip invalid ones.
Source where I found the solution: https://github.com/mkleehammer/pyodbc/issues/112#issuecomment-264734456
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With