Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Struggling with utf-16 encoding/decoding

Tags:

python

utf-16

I'm parsing a document that have some UTF-16 encoded string.

I have a byte string that contains the following:

my_var = b'\xc3\xbe\xc3\xbf\x004\x004\x000\x003\x006\x006\x000\x006\x00-\x001\x000\x000\x003\x008\x000\x006\x002\x002\x008\x005'

When converting to utf-8, I get the following output:

print(my_var.decode('utf-8'))
#> þÿ44036606-10038062285

The first two chars þÿ indicate it's a BOM for UTF-16BE, as indicated on Wikipedia

But, what I don't understand is that if I try the UTF16 BOM like this:

if value.startswith(codecs.BOM_UTF16_BE)

This returns false. In fact, printing codecs.BOM_UTF16_BE doesn't show the same results:

print(codecs.BOM_UTF16_BE)
#> b'\xfe\xff'

Why is that? I'm suspecting some enconding issues on the higher end but not sure how to fix that one.

There are already a few mentions of how to decode UTF-16 on Stackoverflow (like this one), and they all say one thing: Decode using utf-16 and Python will handle the BOM.

... But that doesn't work for me.

print(my_var.decode('utf-16')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ㄀  ㌀㠀 㘀㈀㈀㠀㔀

But with UTF-16BE:

print(my_var.decode('utf-16be')
#> 쎾쎿44036606-10038062285

(the bom is not removed)

And with UTF-16LE:

print(my_var.decode('utf-16le')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ㄀  ㌀㠀 㘀㈀㈀㠀㔀

So, for a reason I can't explain, using only .decode('UTF-16') doesn't work for me. Why?

UPDATE

The original source string isn't the one I mentioned, but this one:

source = '\376\377\0004\0004\0000\0003\0006\0006\0000\0006\000-\0001\0000\0000\0003\0008\0000\0006\0002\0002\0008\0005'

I converted it using the following:

def decode_8bit(cls, match):
    value = match.group().replace(b'\\', b'')
    return chr(int(value, base=8)).encode('utf-8')

my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)

Maybe I did something wrong here?

like image 402
Cyril N. Avatar asked Nov 08 '22 02:11

Cyril N.


1 Answers

It is right that þÿ indicates the BOM for UTF-16BE, if you use the CP1252 encoding.

The difference is the following:

Your first byte is 0xC3, which is 11000011 in binary.

  • UTF-8:

The first two bits are set and indicate that your UTF-8 char is 2 byte long. Getting 0xC3 0xBE for your first character, which is þ for UTF-8.

  • CP1252

CP1252 is always 1 byte long and returns à for 0xC3.

But if you lookup 0xC3 in your linked BOM list you won't find any matching Encoding. Looks like there wasn't a BOM in the first place.

Using the default encoding is probably the way to go, which is UTF-16LE for Windows.

Edit after original source added

Your encoding to UTF-8 destorys the BOM because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes.

OPs solution:

bytes(int(value, base=8))
like image 160
Hyarus Avatar answered Nov 15 '22 12:11

Hyarus