This is a section from Dive Into Python 3 regarding strings:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in utf-8, or a Python string encoded as CP-1252. “Is this string utf-8?” is an invalid question. utf-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
Earlier today I used the hashlib
module and read the help text for md5
that says:
Return a new MD5 hash object; optionally initialized with a string.
Well, it doesn't accept a string
- it accepts a bytes
object.
Maybe I'm reading too much into this, but wouldn't it make more sense if the help text stated a bytes
should be used instead? Or are people using the same name for strings and bytes?
In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer. On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable.
Strings and Character Data in Python The bytes object is one of the core built-in types for manipulating binary data. A bytes object is an immutable sequence of single byte values. Each element in a bytes object is a small integer in the range of 0 to 255.
Note that every string in Python takes additional 49-80 bytes of memory, where it stores supplementary information, such as hash, length, length in bytes, encoding type and string flags. That's why an empty string takes 49 bytes of memory.
But what about a string? A string is composed of: An 8-byte object header (4-byte SyncBlock and a 4-byte type descriptor)
In Python 2 and 3, str
was used both for strings of characters as well as bytes. In Fact, until Python 2.6, there wasn't even a bytes
type (and in 2.6 and 2.7, bytes is str
).
The mentioned inconsistencies in the hashlib documentation are an artifact of this history.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With