Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String encode/decode issue - missing character from end

I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).

# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'

# prints something in chineese 
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭

The closest I have gone is via encoding it to utf-16 as:

>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5

But the actual value that I need as per the value store in database is:

WAGCenter-04190517953516060503-20160605124857-4190-51

I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.

Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.

Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end

>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5

On trying for some other columns, it is working. What might be the cause of this intermittent issue?

like image 732
user7001260 Avatar asked Oct 18 '22 00:10

user7001260


2 Answers

You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:

  • the type of the database
  • the way you input values in it
  • the way you extract values to obtain your pseudo unicode string
  • the actual content if you use direct (native) database access

What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:

def cvt(ustring):
    l = []
    for uc in ustring:
        l.append(chr(ord(uc) & 0xFF)) # low order byte
        l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
    return ''.join(l)

cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'
like image 166
Serge Ballesta Avatar answered Oct 21 '22 09:10

Serge Ballesta


The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.

Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.

like image 41
user7001260 Avatar answered Oct 21 '22 10:10

user7001260