Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Attempting to replicate Python's string interning functionality for non-strings

For a self-project, I wanted to do something like:

class Species(object): # immutable.
    def __init__(self, id):
        # ... (using id to obtain height and other data from file)
    def height(self):
        # ...

class Animal(object): # mutable.

    def __init__(self, nickname, species_id):
        self.nickname = nickname
        self.species = Species(id)
    def height(self):
        return self.species.height()

As you can see, I don't really need more than one instance of Species(id) per id, but I'd be creating one every time I'm creating an Animal object with that id, and I'd probably need multiple calls of, say, Animal(somename, 3).

To solve that, what I'm trying to do is to make a class so that for 2 instances of it, let's say a and b, the following is always true:

(a == b) == (a is b)

This is something that Python does with string literals and is called internship. Example:

a = "hello"
b = "hello"
print(a is b)

that print will yield true (as long as the string is short enough if we're using the python shell directly).

I can only guess how CPython does this (it probably involves some C magic) so I'm doing my own version of it. So far I've got:

class MyClass(object):

    myHash = {} # This replicates the intern pool.

    def __new__(cls, n): # The default new method returns a new instance
        if n in MyClass.myHash:
            return MyClass.myHash[n]

        self = super(MyClass, cls).__new__(cls)
        self.__init(n)
        MyClass.myHash[n] = self

        return self

    # as pointed out on an answer, it's better to avoid initializating the instance 
    # with __init__, as that one's called even when returning an old instance.
    def __init(self, n): 
        self.n = n

a = MyClass(2)
b = MyClass(2)

print a is b # <<< True

My questions are:

a) Is my problem even worth solving? Since my intended Species object should be quite light weight and the max amount of times Animal can be called, rather limited (imagine a Pokemon game: no more than 1000 instances, tops)

b) If it is, is this a valid approach to solve my problem?

c) If it's not valid, could you please elaborate on a simpler / cleaner / more Pythonic way to solve this?

like image 810
Dleep Avatar asked Oct 23 '15 00:10

Dleep


1 Answers

To make this as general as possible, I'm going to recommend a couple things. One, inherit from a namedtuple if you want "true" immutability (normally people are rather hands off about this, but when you're doing interning, breaking the immutable invariant can cause much bigger problems). Second, use locks to allow thread safe behavior.

Because this is rather complex, I'm going to provide a modified copy of Species code with comments explaining it:

import collections
import operator
import threading

# Inheriting from a namedtuple is a convenient way to get immutability
class Species(collections.namedtuple('SpeciesBase', 'species_id height ...')):
    __slots__ = ()  # Prevent creation of arbitrary values on instances; true immutability of declared values from namedtuple makes true immutable instances

    # Lock and cache, with underscore prefixes to indicate they're internal details
    _cache_lock = threading.Lock()
    _cache = {} 

    def __new__(cls, species_id):  # Switching to canonical name cls for class type
        # Do quick fail fast check that ID is in fact an int/long
        # If it's int-like, this will force conversion to true int/long
        # and minimize risk of incompatible hash/equality checks in dict
        # lookup
        # I suspect that in CPython, this would actually remove the need
        # for the _cache_lock due to the GIL protecting you at the
        # critical stages (because no byte code is executing comparing
        # or hashing built-in int/long types), but the lock is a good idea
        # for correctness (avoiding reliance on implementation details)
        # and should cost little
        species_id = operator.index(species_id)

        # Lock when checking/mutating cache to make it thread safe
        try:
            with cls._cache_lock:
                return cls._cache[species_id]
        except KeyError:
            pass

        # Read in data here; not done under lock on assumption this might
        # be expensive and other Species (that already exist) might be
        # created/retrieved from cache during this time
        species_id = ...
        height = ...
        # Pass all the values read to the superclass (the namedtuple base)
        # constructor (which will set them and leave them immutable thereafter)
        self = super(Species, cls).__new__(cls, species_id, height, ...)

        with cls._cache_lock:
            # If someone tried to create the same species and raced
            # ahead of us, use their version, not ours to ensure uniqueness
            # If no one raced us, this will put our new object in the cache
            self = cls._cache.setdefault(species_id, self)
        return self

If you want to do interning for general libraries (where users might be threaded, and you can't trust them not to break the immutability invariant), something like the above is a basic structure to work with. It's fast, minimizes the opportunity for stalls even if construction is heavyweight (in exchange for possibly reconstructing an object more than once and throwing away all but one copy if many threads try to construct it for the first time at once), etc.

Of course, if construction is cheap and instances are small, then just write a __eq__ (and possibly __hash__ if it's logically immutable) and be done with it.

like image 155
ShadowRanger Avatar answered Oct 30 '22 13:10

ShadowRanger