Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will a UNICODE string just containing ASCII characters always be equal to the ASCII string?

I noticed the following holds:

>>> u'abc' == 'abc'
True
>>> 'abc' == u'abc'
True

Will this always be true or could it possibly depend on the system locale? (It seems strings are unicode in python 3: e.g. this question, but bytes in 2.x)

like image 528
doctorlove Avatar asked Feb 20 '15 11:02

doctorlove


People also ask

What is the difference between ASCII string and Unicode string?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers.

What is the relationship between ASCII and Unicode?

ASCII has its equivalent in Unicode. The difference between ASCII and Unicode is that ASCII represents lowercase letters (a-z), uppercase letters (A-Z), digits (0-9) and symbols such as punctuation marks while Unicode represents letters of English, Arabic, Greek etc.

Is Unicode a subset of ASCII?

Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode.

How can Unicode represent more characters than ASCII?

ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original. In contrast, Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings.


1 Answers

Python 2 coerces between unicode and str using the ASCII codec when comparing the two types. So yes, this is always true.

That is to say, unless you mess up your Python installation and use sys.setdefaultencoding() to change that default. You cannot do that normally, because the sys.setdefaultencoding() function is deleted from the module at start-up time, but there is a Cargo Cult going around where people use reload(sys) to reinstate that function and change the default encoding to something else to try and fix implicit encoding and decoding problems. This is a dumb thing to do for precisely this reason.

like image 134
Martijn Pieters Avatar answered Sep 26 '22 12:09

Martijn Pieters