Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

converting string to unicode type in python

I'm trying this code:

s = "سلام"
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))

but this error occurs:

'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 0: ordinal not in range(128)

I tried '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) but nothing changed.

what should I do?

like image 819
Aidin.T Avatar asked Oct 08 '13 21:10

Aidin.T


People also ask

How do you Unicode a string in Python?

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

What is Unicode types of strings in Python?

Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

What does Unicode () do in Python?

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.


1 Answers

Since you're using python 2, s = "سلام" is a byte string (in whatever encoding your terminal uses, presumably utf8):

>>> s = "سلام"
>>> s
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'

You cannot encode byte strings (as they are already "encoded"). You're looking for unicode ("real") strings, which in python2 must be prefixed with u:

>>> s = u"سلام"
>>> s
u'\u0633\u0644\u0627\u0645'
>>> '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
'1101100010110011110110011000010011011000101001111101100110000101'

If you're getting a byte string from a function such as raw_input then your string is already encoded - just skip the encode part:

'{:b}'.format(int(s.encode('hex'), 16))

or (if you're going to do anything else with it) convert it to unicode:

s = s.decode('utf8')

This assumes that your input is UTF-8 encoded, if this might not be the case, check sys.stdin.encoding first.

i10n stuff is complicated, here are two articles that will help you further:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

  • What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

like image 105
georg Avatar answered Sep 30 '22 11:09

georg