After two questions regarding the distinction between the datatypes str
and unicode
, I'm still puzzled at the following.
In Block 1 we see that the type of the city is unicode
, as we're expecting.
Yet in Block 2, after a round-trip through disk (redis), the type of the city is str
(and the representation is different).
The dogma of storing utf-8
on disk, reading into unicode
, and writing back in utf-8
is failing somewhere.
Why is the second instance of type(city)
str
rather than unicode
?
Just as importantly, does it matter? Do you care whether your variables are unicode
or str
, or are you oblivious to the difference just so long as the code "does the right thing"?
# -*- coding: utf-8 -*-
# Block 1
city = u'Düsseldorf'
print city, type(city), repr(city)
# Düsseldorf <type 'unicode'> u'D\xfcsseldorf'
# Block 2
import redis
r_server = redis.Redis('localhost')
r_server.set('city', city)
city = r_server.get('city')
print city, type(city), repr(city)
# Düsseldorf <type 'str'> 'D\xc3\xbcsseldorf'
Redis strings store sequences of bytes, including text, serialized objects, and binary arrays. As such, strings are the most basic Redis data type. They're often used for caching, but they support additional functionality that lets you implement counters and perform bitwise operations, too.
Instances of str contain Unicode characters. In Python 2, there are two types that represent sequences of characters: str and unicode. In contrast to Python 3, instances of str contain raw 8-bit values. Instances of unicode contain Unicode characters. There are many ways to represent Unicode characters as binary data (raw 8-bit values).
As such, strings are the most basic Redis data type. They're often used for caching, but they support additional functionality that lets you implement counters and perform bitwise operations, too. Store a serialized JSON string and set it to expire 100 seconds from now:
These random-access string commands may cause performance issues when dealing with large strings. If you're storing structured data as a serialized string, you may also want to consider Redis hashes or RedisJSON. Redis Strings Explained is a short, comprehensive video explainer on Redis strings.
It's not dogmatic why character sets and encodings are used - it's a necessity. Hopefully, you will have read enough to understand why we have so many character sets in use. Unicode is obviously the way forward (having all characters mapped), but how do you transfer a Unicode character from one machine to another, or save it to disk?
We could use the Unicode point value, but as Unicode points are effectively 32bits, each character would need to be saved/transferred as the whole 32bits (aka UTF-32). a
would be encoded as 0x00000061
- that's a lot of wasted bits just for one character. UTF-16 is a little less wasteful when dealing with mostly ASCII, but UTF-8 is the best compromise by using the least amount of bits.
Using decoded Unicode within code obviously frees developers from having to consider the intricacies of encoding, such as how many bytes equal a character.
As suggested by @J.F.Sebastian, the redis-py driver includes a decode_responses
option on the Redis
and Connection
classes. When set to True
the client will decode the responses using the encoding
option. By default encoding = utf-8
.
E.g.
r_server = redis.Redis('localhost', decode_responses=True)
city = r_server.get('city')
# city = <type 'unicode'>
No longer required since discovery of decode_responses
.
It would appear that the Redis driver is rather simplistic - it so happens that if you send a Unicode it'll convert it to the default encoding (UTF-8 is most cases). On response, Redis doesn't know the encoding so returns an str
for you to decode as appropriate.
Therefore, if would be safer to encode your strings to UTF-8 before sending to Redis and decode as UTF-8 on response. Other DB drivers are more advanced, so receive and return Unicodes.
But of course, you shouldn't be peppering your code with .encode()
and .decode()
. The common approach is to form "Unicode sandwiches", so that external data is decoded to Unicode on input and encoded on output. So how does that work for you? Wrap the Redis driver so that it returns what you want, thereby pushing the decoding back into the periphery of your code.
For example, it should be as simple as:
class UnicodeRedis(redis.Redis):
def __init__(self, *args, **kwargs):
if "encoding" in kwargs:
self.encoding = kwargs["encoding"]
else:
self.encoding = "utf-8"
super(UnicodeRedis, self).__init__(*args, **kwargs)
def get(self, *args, **kwargs):
result = super(UnicodeRedis, self).get(*args, **kwargs)
if isinstance(result, str):
return result.decode(self.encoding)
else:
return result
You can then interact with it as normal except that you can pass an encoding
argument that changes how strings are decoded. If you don't set encoding
, this code will assume utf-8
.
E.g.
r_server = UnicodeRedis('localhost')
city = r_server.get('city')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With