Python3 has unicode strings (str
) and bytes
. We already have bytestring literals and methods. Why do we need two different types, instead of just byte strings of various encodings?
The answer to your question depends on the meaning of the word "need."
We certainly don't need the str
type in the sense that everything we can compute with the type we can also compute without it (as you know quite well from your well-worded question).
But we can also understand "need" from the point of view of convenience. Isn't it nice to have a sqrt
function? Or log
or exp
or sin
? You could write these yourself, but why bother? A standard library designer will add functions that are useful and convenient.
It is the same for the language itself. Do we "need" a while loop? Not really, we can use tail-recursive functions. Do we "need" list comprehensions? Tons of things in Python are not primitive. For that matter do we "need" high level languages. John von Neumann himself once asked "why would you want more than machine language?"
It is the same with str
and bytes
. The type str
, while not necessary, is a nice, time-saving, convenient thing to have. It gives us an interface as a sequence of characters, so that we can manipulate text character-by-character without:
each_byte
and each_char
.As you suspect, we could have one type which exposes the byte sequence and the character sequence (as Ruby's String
class does). The Python designers wanted to separate those usages into two separate types. You can convert an object of one type into the other very easily. By having two types, they are saying that separation of concerns (and usages) is more important than having fewer built-in types. Ruby makes a different choice.
TL;DR It's a matter of preference in language design: separation of concerns by distinct type rather than by different methods on the same type.
Because bytes should not be considered strings, and strings should not be considered bytes. Python3 gets this right, no matter how jarring this feels to the brand new developer.
In Python 2.6, if I read data from a file, and I passed the "r" flag, the text would be read in the current locale by default, which would be a string, while passing the "rb" flag would create a series of bytes. Indexing the data is entirely different, and methods that take a str
may be unsure of whether I am using bytes or a str. This gets worse since for ASCII data the two are often synonymous, meaning that code which works in simple test cases or English locales will fail upon encountering non-ASCII characters.
There was therefore a conscious effort to ensure bytes and strings were not identical: that one was a sequence of "dumb bytes", and the other was a Unicode string with the optimal encoding for the data to preserve O(1) indexing (ASCII, UCS-2, or UTF-32, depending on the data used, I believe).
In Python 2, the Unicode string was used to disambiguate text from "dumb bytes", however, str
was treated as text by many users.
Or, to quote the Benevolent Dictator:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs. In future versions of Python, string objects will be used for holding character data. The bytes object will fulfil the role of a byte container. Eventually the unicode type will be renamed to str and the old str type will be removed.
tl;dr version
Forcing the separation of bytes
and str
forces coders to be conscious of their difference, to short-term dissatisfaction, but better code long-term. It's a conscious choice after years of experience: that forcing you to be conscious of the difference immediately will save you days in a debugger later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With