Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

string.decode() vs. unicode(string)

myString = 'éíěřáé'

I need to decode this string to unicode. Is there any difference between folowing usages and between these two methods in general?

myString.decode(encoding='UTF-8', errors='ignore')

and

unicode(myString, encoding='UTF-8', errors='ignore')
like image 502
Meloun Avatar asked Aug 08 '12 09:08

Meloun


People also ask

What is the difference between Unicode string and string?

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str . Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes. What before was valid UTF-8, isn't anymore.

What is the difference between Unicode and string in Python?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.

What is the difference between Unicode string and byte string?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.

What does decode () do?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.


2 Answers

The unicode constructor can take other types apart from strings:

>>> unicode(10)
u'10'

For the bytestring case, however, the two forms are mostly equivalent. Some encoding options are not valid for the unicode constructor as they do not result in unicode output, but are valid for the .decode method of bytestrings, such as 'hex':

>>> unicode('10', encoding='hex')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoder did not return an unicode object (type=str)
like image 90
Martijn Pieters Avatar answered Oct 04 '22 04:10

Martijn Pieters


They're essentially the same, but with some minor performance shortcuts in either case; str.decode knows that its argument is a string, so it can shortcut type checking of its argument, while unicode.__new__ has shortcuts for some common encodings including UTF-8.

Both methods call into PyCodec_Decode in the general case.

like image 36
ecatmur Avatar answered Oct 04 '22 04:10

ecatmur