String encode/decode issue - missing character from end

Question

I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).

# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'

# prints something in chineese 
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭

The closest I have gone is via encoding it to utf-16 as:

>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5

But the actual value that I need as per the value store in database is:

WAGCenter-04190517953516060503-20160605124857-4190-51

I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.

Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.

Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end

>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5

On trying for some other columns, it is working. What might be the cause of this intermittent issue?

Serge Ballesta · Accepted Answer

You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:

the type of the database
the way you input values in it
the way you extract values to obtain your pseudo unicode string
the actual content if you use direct (native) database access

What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:

def cvt(ustring):
    l = []
    for uc in ustring:
        l.append(chr(ord(uc) & 0xFF)) # low order byte
        l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
    return ''.join(l)

cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'

user7001260 · Answer

The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.

Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.

String encode/decode issue - missing character from end

Tags:

python

encode

python-2.7

pyodbc

netezza

user7001260

2 Answers

Serge Ballesta

user7001260

Recent Activity

Donate For Us

String encode/decode issue - missing character from end

Tags:

python

encode

python-2.7

pyodbc

netezza

user7001260

2 Answers

Serge Ballesta

user7001260

Related questions

Recent Activity

Donate For Us