I have the following str
:"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
This comes from a filename: Расшифровка_RootKit.com_63k.txt
My problem is a cannot reverse the first str
to the second one. I have tried a few things, using en/decode()
, bytes()
, etc but I did not manage.
One thing I noticed was b'' and bytes() have different outputs:
path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))
print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode('utf8'))
Results:
РаÑÑиÑ
Ñовка_RootKit.com_63k.txt
Расшифровка_RootKit.com_63k.txt
So I wonder what is the difference between b''
and bytes()
; maybe it will help me solving my problem !
You may want to use solution with latin1
, scroll to that answer firstly. This answer works if you accidentally copied bytes content and pasted as a string.
If you want to convert them back to bytes, use the following code:
In [22]: path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
In [23]: bytes(map(ord, path)).decode('utf-8')
Out[23]: 'Расшифровка_RootKit.com_63k.txt'
Explanation is quite simple, lets use the first character from the string:
In [40]: '\xd0'
Out[40]: 'Ð'
In [41]: b'\xd0'
Out[41]: b'\xd0'
As you can see, string converts \xd0
to a unicode character with number 0xd0
, while bytes just interprets this as a single byte.
UTF-8 uses the following mask for all characters between U+0080
and U+07FF
: 110xxxxx
for the first byte and 10xxxxxx
for the second byte. This is exactly what you gets when directly converting that string to bytes:
In [43]: [bin(x) for x in '\xd0'.encode('utf-8')]
Out[43]: ['0b11000011', '0b10010000']
And the actual symbol code is 00011
+ 010000
(concatenation, not addition), which is 0xd0
:
In [44]: hex(int('00011010000', 2))
Out[44]: '0xd0'
To get this number from a character we can use ord
:
In [45]: hex(ord('\xd0'))
Out[45]: '0xd0'
And then just applying it to the whole string and converting it back to bytes:
In [46]: bytes(map(ord, path)).decode('utf-8')
Out[46]: 'Расшифровка_RootKit.com_63k.txt'
Note that if your string character does not fit in byte for some reason the code above will raise an error:
In [47]: bytes([ord(chr(256))])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-5555e18dbece> in <module>
----> 1 bytes([ord(chr(256))])
ValueError: bytes must be in range(0, 256)
b''
is a prefix, that causes the following string to be interpreted as a bytes
-type object. The bytes
function takes a string and returns a bytes
object.
print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode
This works, because you are decoding a bytes object.
path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))
This does not work as intended, because you are treating path
as a string, then converting it into a bytes object, then trying to decode what comes out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With