I believe most of you who are familiar with Python have read Dive Into Python 3. In chapter 4.3, it says this:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question.
Somehow I understand what this means: strings = characters in the Unicode set, and Python can help you encode characters according to different encoding methods. However, are characters in Pythons stored as bytes in computers anyway? For example, s = 'strings', and s is surely stored in my computer as a byte strem '0100100101...' or whatever. Then what is this encoding method used here - The "default" encoding method of Python?
Thanks!
CPython stores strings as sequences of unicode characters. Unicode characters are stored with either 1, 2, or 4 bytes depending on the size of their encoding. Byte size of strings increases proportionally with the size of its largest character, since all characters must be of the same size.
String literals inside triple quotes, """ or ''', can span multiple lines of text. Python strings are "immutable" which means they cannot be changed after they are created (Java strings also use this immutable style). Since strings can't be changed, we construct *new* strings as we go to represent computed values.
How to create a string in Python? Strings can be created by enclosing characters inside a single quote or double-quotes. Even triple quotes can be used in Python but generally used to represent multiline strings and docstrings.
Strings. Strings are sequences of character data. The string type in Python is called str . String literals may be delimited using either single or double quotes.
Python 3 distinguishes between text and binary data. Text is guaranteed to be in Unicode, though no specific encoding is specified, as far as I could see. So it could be UTF-8, or UTF-16, or UTF-32¹ – but you wouldn't even notice.
The main point here is: You shouldn't even care. If you want to deal with text, then use text strings and access them by code point (which is the number of a single Unicode character and independent of the internal UTF – which may organise code points in several smaller code units). If you want bytes, then use b""
and access them by byte. And if you want to have a string in a byte sequence in a specific encoding, you use .encode()
.
¹ Or even UTF-9, if someone is insane enough to implement Python on a PDP-10.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With