I'm trying this code:
s = "سلام"
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
but this error occurs:
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 0: ordinal not in range(128)
I tried '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
but nothing changed.
what should I do?
To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.
Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
Since you're using python 2, s = "سلام"
is a byte string (in whatever encoding your terminal uses, presumably utf8):
>>> s = "سلام"
>>> s
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
You cannot encode
byte strings (as they are already "encoded"). You're looking for unicode ("real") strings, which in python2 must be prefixed with u
:
>>> s = u"سلام"
>>> s
u'\u0633\u0644\u0627\u0645'
>>> '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
'1101100010110011110110011000010011011000101001111101100110000101'
If you're getting a byte string from a function such as raw_input
then your string is already encoded - just skip the encode
part:
'{:b}'.format(int(s.encode('hex'), 16))
or (if you're going to do anything else with it) convert it to unicode:
s = s.decode('utf8')
This assumes that your input is UTF-8 encoded, if this might not be the case, check sys.stdin.encoding
first.
i10n stuff is complicated, here are two articles that will help you further:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With