Struggling with utf-16 encoding/decoding

Question

I'm parsing a document that have some UTF-16 encoded string.

I have a byte string that contains the following:

my_var = b'\xc3\xbe\xc3\xbf\x004\x004\x000\x003\x006\x006\x000\x006\x00-\x001\x000\x000\x003\x008\x000\x006\x002\x002\x008\x005'

When converting to utf-8, I get the following output:

print(my_var.decode('utf-8'))
#> þÿ44036606-10038062285

The first two chars þÿ indicate it's a BOM for UTF-16BE, as indicated on Wikipedia

But, what I don't understand is that if I try the UTF16 BOM like this:

if value.startswith(codecs.BOM_UTF16_BE)

This returns false. In fact, printing codecs.BOM_UTF16_BE doesn't show the same results:

print(codecs.BOM_UTF16_BE)
#> b'\xfe\xff'

Why is that? I'm suspecting some enconding issues on the higher end but not sure how to fix that one.

There are already a few mentions of how to decode UTF-16 on Stackoverflow (like this one), and they all say one thing: Decode using utf-16 and Python will handle the BOM.

... But that doesn't work for me.

print(my_var.decode('utf-16')
#> 뻃뿃㐀㐀　㌀㘀㘀　㘀ⴀ㄀　　㌀㠀　㘀㈀㈀㠀㔀

But with UTF-16BE:

print(my_var.decode('utf-16be')
#> 쎾쎿44036606-10038062285

(the bom is not removed)

And with UTF-16LE:

print(my_var.decode('utf-16le')
#> 뻃뿃㐀㐀　㌀㘀㘀　㘀ⴀ㄀　　㌀㠀　㘀㈀㈀㠀㔀

So, for a reason I can't explain, using only .decode('UTF-16') doesn't work for me. Why?

UPDATE

The original source string isn't the one I mentioned, but this one:

source = '\376\377\0004\0004\0000\0003\0006\0006\0000\0006\000-\0001\0000\0000\0003\0008\0000\0006\0002\0002\0008\0005'

I converted it using the following:

def decode_8bit(cls, match):
    value = match.group().replace(b'\', b'')
    return chr(int(value, base=8)).encode('utf-8')

my_var = re.sub(b'\\[0-9]{1,3}', decode_8bit, source)

Maybe I did something wrong here?

Hyarus · Accepted Answer

It is right that þÿ indicates the BOM for UTF-16BE, if you use the CP1252 encoding.

The difference is the following:

Your first byte is 0xC3, which is 11000011 in binary.

UTF-8:

The first two bits are set and indicate that your UTF-8 char is 2 byte long. Getting 0xC3 0xBE for your first character, which is þ for UTF-8.

CP1252

CP1252 is always 1 byte long and returns Ã for 0xC3.

But if you lookup 0xC3 in your linked BOM list you won't find any matching Encoding. Looks like there wasn't a BOM in the first place.

~~Using the default encoding is probably the way to go, which is UTF-16LE for Windows.~~

Edit after original source added

Your encoding to UTF-8 destorys the BOM because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes.

OPs solution:

bytes(int(value, base=8))

Struggling with utf-16 encoding/decoding

Tags:

python

utf-16

Cyril N.

1 Answers

Hyarus

Recent Activity

Donate For Us

Struggling with utf-16 encoding/decoding

Tags:

python

utf-16

Cyril N.

1 Answers

Hyarus

Related questions

Recent Activity

Donate For Us