This is a Python 101 type question, but it had me baffled for a while when I tried to use a package that seemed to convert my string input into bytes.
As you will see below I found the answer for myself, but I felt it was worth recording here because of the time it took me to unearth what was going on. It seems to be generic to Python 3, so I have not referred to the original package I was playing with; it does not seem to be an error (just that the particular package had a .tostring()
method that was clearly not producing what I understood as a string...)
My test program goes like this:
import mangler # spoof package stringThing = """ <Doc> <Greeting>Hello World</Greeting> <Greeting>你好</Greeting> </Doc> """ # print out the input print('This is the string input:') print(stringThing) # now make the string into bytes bytesThing = mangler.tostring(stringThing) # pseudo-code again # now print it out print('\nThis is the bytes output:') print(bytesThing)
The output from this code gives this:
This is the string input: <Doc> <Greeting>Hello World</Greeting> <Greeting>你好</Greeting> </Doc> This is the bytes output: b'\n<Doc>\n <Greeting>Hello World</Greeting>\n <Greeting>\xe4\xbd\xa0\xe5\xa5\xbd</Greeting>\n</Doc>\n'
So, there is a need to be able to convert between bytes and strings, to avoid ending up with non-ascii characters being turned into gobbledegook.
Similarly, Decoding is process to convert a Byte object to String. It is implemented using decode() . A byte string can be decoded back into a character string, if you know which encoding was used to encode it.
b means bytes , not binary. \x00 is not string 0 but char with code 0 which can't be displayed so Python shows its code. – furas.
Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.
The 'mangler' in the above code sample was doing the equivalent of this:
bytesThing = stringThing.encode(encoding='UTF-8')
There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8')
, but the above syntax makes it obvious what is going on, and also what to do to recover the string:
newStringThing = bytesThing.decode(encoding='UTF-8')
When we do this, the original string is recovered.
Note, using str(bytesThing)
just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8')
. No error is reported if the encoding is not specified.
In python3, there is a bytes()
method that is in the same format as encode()
.
str1 = b'hello world' str2 = bytes("hello world", encoding="UTF-8") print(str1 == str2) # Returns True
I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode
and decode
, and without having to prefex b
in front of quotes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With