I'm parsing a document that have some UTF-16 encoded string.
I have a byte string that contains the following:
my_var = b'\xc3\xbe\xc3\xbf\x004\x004\x000\x003\x006\x006\x000\x006\x00-\x001\x000\x000\x003\x008\x000\x006\x002\x002\x008\x005'
When converting to utf-8, I get the following output:
print(my_var.decode('utf-8'))
#> þÿ44036606-10038062285
The first two chars þÿ indicate it's a BOM for UTF-16BE, as indicated on Wikipedia
But, what I don't understand is that if I try the UTF16 BOM like this:
if value.startswith(codecs.BOM_UTF16_BE)
This returns false. In fact, printing codecs.BOM_UTF16_BE
doesn't show the same results:
print(codecs.BOM_UTF16_BE)
#> b'\xfe\xff'
Why is that? I'm suspecting some enconding issues on the higher end but not sure how to fix that one.
There are already a few mentions of how to decode UTF-16 on Stackoverflow (like this one), and they all say one thing: Decode using utf-16
and Python will handle the BOM.
... But that doesn't work for me.
print(my_var.decode('utf-16')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ ㌀㠀 㘀㈀㈀㠀㔀
But with UTF-16BE:
print(my_var.decode('utf-16be')
#> 쎾쎿44036606-10038062285
(the bom is not removed)
And with UTF-16LE:
print(my_var.decode('utf-16le')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ ㌀㠀 㘀㈀㈀㠀㔀
So, for a reason I can't explain, using only .decode('UTF-16')
doesn't work for me. Why?
UPDATE
The original source string isn't the one I mentioned, but this one:
source = '\376\377\0004\0004\0000\0003\0006\0006\0000\0006\000-\0001\0000\0000\0003\0008\0000\0006\0002\0002\0008\0005'
I converted it using the following:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return chr(int(value, base=8)).encode('utf-8')
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
Maybe I did something wrong here?
It is right that þÿ indicates the BOM for UTF-16BE, if you use the CP1252 encoding.
The difference is the following:
Your first byte is 0xC3, which is 11000011 in binary.
The first two bits are set and indicate that your UTF-8 char is 2 byte long. Getting 0xC3 0xBE for your first character, which is þ for UTF-8.
CP1252 is always 1 byte long and returns à for 0xC3.
But if you lookup 0xC3 in your linked BOM list you won't find any matching Encoding. Looks like there wasn't a BOM in the first place.
Using the default encoding is probably the way to go, which is UTF-16LE for Windows.
Edit after original source added
Your encoding to UTF-8 destorys the BOM because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes.
OPs solution:
bytes(int(value, base=8))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With