Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between bytes() and b''

I have the following str:
"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"

This comes from a filename: Расшифровка_RootKit.com_63k.txt

My problem is a cannot reverse the first str to the second one. I have tried a few things, using en/decode(), bytes(), etc but I did not manage.

One thing I noticed was b'' and bytes() have different outputs:

path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))
print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode('utf8'))

Results:

РаÑÑиÑ
         Ñовка_RootKit.com_63k.txt
Расшифровка_RootKit.com_63k.txt

So I wonder what is the difference between b'' and bytes(); maybe it will help me solving my problem !

like image 821
Chocorean Avatar asked Jan 25 '23 21:01

Chocorean


2 Answers

You may want to use solution with latin1, scroll to that answer firstly. This answer works if you accidentally copied bytes content and pasted as a string.

If you want to convert them back to bytes, use the following code:

In [22]: path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"

In [23]: bytes(map(ord, path)).decode('utf-8')
Out[23]: 'Расшифровка_RootKit.com_63k.txt'

Explanation is quite simple, lets use the first character from the string:

In [40]: '\xd0'
Out[40]: 'Ð'

In [41]: b'\xd0'
Out[41]: b'\xd0'

As you can see, string converts \xd0 to a unicode character with number 0xd0, while bytes just interprets this as a single byte.

UTF-8 uses the following mask for all characters between U+0080 and U+07FF: 110xxxxx for the first byte and 10xxxxxx for the second byte. This is exactly what you gets when directly converting that string to bytes:

In [43]: [bin(x) for x in '\xd0'.encode('utf-8')]
Out[43]: ['0b11000011', '0b10010000']

And the actual symbol code is 00011 + 010000 (concatenation, not addition), which is 0xd0:

In [44]: hex(int('00011010000', 2))
Out[44]: '0xd0'

To get this number from a character we can use ord:

In [45]: hex(ord('\xd0'))
Out[45]: '0xd0'

And then just applying it to the whole string and converting it back to bytes:

In [46]: bytes(map(ord, path)).decode('utf-8')
Out[46]: 'Расшифровка_RootKit.com_63k.txt'

Note that if your string character does not fit in byte for some reason the code above will raise an error:

In [47]: bytes([ord(chr(256))])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-5555e18dbece> in <module>
----> 1 bytes([ord(chr(256))])

ValueError: bytes must be in range(0, 256)
like image 161
awesoon Avatar answered Jan 28 '23 11:01

awesoon


b'' is a prefix, that causes the following string to be interpreted as a bytes-type object. The bytes function takes a string and returns a bytes object.

print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode

This works, because you are decoding a bytes object.

path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))

This does not work as intended, because you are treating path as a string, then converting it into a bytes object, then trying to decode what comes out.

like image 26
3ch0 Avatar answered Jan 28 '23 10:01

3ch0