Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distinction between str and unicode: why does Redis return binary data when passed unicode?

After two questions regarding the distinction between the datatypes str and unicode, I'm still puzzled at the following.

In Block 1 we see that the type of the city is unicode, as we're expecting.

Yet in Block 2, after a round-trip through disk (redis), the type of the city is str (and the representation is different).

The dogma of storing utf-8 on disk, reading into unicode, and writing back in utf-8 is failing somewhere.

Why is the second instance of type(city) str rather than unicode?

Just as importantly, does it matter? Do you care whether your variables are unicode or str, or are you oblivious to the difference just so long as the code "does the right thing"?

# -*- coding: utf-8 -*-

# Block 1
city = u'Düsseldorf'
print city, type(city), repr(city)
# Düsseldorf <type 'unicode'> u'D\xfcsseldorf'

# Block 2
import redis
r_server = redis.Redis('localhost')
r_server.set('city', city)
city = r_server.get('city')
print city, type(city), repr(city)
# Düsseldorf <type 'str'> 'D\xc3\xbcsseldorf'
like image 798
Calaf Avatar asked Mar 01 '16 15:03

Calaf


People also ask

What is a string in Redis?

Redis strings store sequences of bytes, including text, serialized objects, and binary arrays. As such, strings are the most basic Redis data type. They're often used for caching, but they support additional functionality that lets you implement counters and perform bitwise operations, too.

What is the difference between str and Unicode?

Instances of str contain Unicode characters. In Python 2, there are two types that represent sequences of characters: str and unicode. In contrast to Python 3, instances of str contain raw 8-bit values. Instances of unicode contain Unicode characters. There are many ways to represent Unicode characters as binary data (raw 8-bit values).

What is the best Redis data type to use?

As such, strings are the most basic Redis data type. They're often used for caching, but they support additional functionality that lets you implement counters and perform bitwise operations, too. Store a serialized JSON string and set it to expire 100 seconds from now:

Are there performance issues with random-access strings in Redis?

These random-access string commands may cause performance issues when dealing with large strings. If you're storing structured data as a serialized string, you may also want to consider Redis hashes or RedisJSON. Redis Strings Explained is a short, comprehensive video explainer on Redis strings.


1 Answers

Dogma?

It's not dogmatic why character sets and encodings are used - it's a necessity. Hopefully, you will have read enough to understand why we have so many character sets in use. Unicode is obviously the way forward (having all characters mapped), but how do you transfer a Unicode character from one machine to another, or save it to disk?

We could use the Unicode point value, but as Unicode points are effectively 32bits, each character would need to be saved/transferred as the whole 32bits (aka UTF-32). a would be encoded as 0x00000061 - that's a lot of wasted bits just for one character. UTF-16 is a little less wasteful when dealing with mostly ASCII, but UTF-8 is the best compromise by using the least amount of bits.

Using decoded Unicode within code obviously frees developers from having to consider the intricacies of encoding, such as how many bytes equal a character.

Solutions

Redis Client

As suggested by @J.F.Sebastian, the redis-py driver includes a decode_responses option on the Redis and Connection classes. When set to True the client will decode the responses using the encoding option. By default encoding = utf-8.

E.g.

r_server = redis.Redis('localhost', decode_responses=True)
city = r_server.get('city')
# city = <type 'unicode'>

Wrapper Class

No longer required since discovery of decode_responses.

It would appear that the Redis driver is rather simplistic - it so happens that if you send a Unicode it'll convert it to the default encoding (UTF-8 is most cases). On response, Redis doesn't know the encoding so returns an str for you to decode as appropriate.

Therefore, if would be safer to encode your strings to UTF-8 before sending to Redis and decode as UTF-8 on response. Other DB drivers are more advanced, so receive and return Unicodes.

But of course, you shouldn't be peppering your code with .encode() and .decode(). The common approach is to form "Unicode sandwiches", so that external data is decoded to Unicode on input and encoded on output. So how does that work for you? Wrap the Redis driver so that it returns what you want, thereby pushing the decoding back into the periphery of your code.

For example, it should be as simple as:

class UnicodeRedis(redis.Redis):

    def __init__(self, *args, **kwargs):
        if "encoding" in kwargs:
            self.encoding = kwargs["encoding"]
        else:
            self.encoding = "utf-8"
        super(UnicodeRedis, self).__init__(*args, **kwargs)

    def get(self, *args, **kwargs):
        result = super(UnicodeRedis, self).get(*args, **kwargs)
        if isinstance(result, str):
            return result.decode(self.encoding)
        else:
            return result

You can then interact with it as normal except that you can pass an encoding argument that changes how strings are decoded. If you don't set encoding, this code will assume utf-8.

E.g.

r_server = UnicodeRedis('localhost')
city = r_server.get('city')

like image 133
Alastair McCormack Avatar answered Oct 04 '22 15:10

Alastair McCormack