Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Retrieve items from a set

In general, Python sets don't seem to be designed for retrieving items by key. That's obviously what dictionaries are for. But is there anyway that, given a key, you can retrieve an instance from a set which is equal to the key?

Again, I know this is exactly what dictionaries are for, but as far as I can see, there are legitimate reasons to want to do this with a set. Suppose you have a class defined something like:

class Person:
   def __init__(self, firstname, lastname, age):
      self.firstname = firstname
      self.lastname = lastname
      self.age = age

Now, suppose I am going to be creating a large number of Person objects, and each time I create a Person object I need to make sure it is not a duplicate of a previous Person object. A Person is considered a duplicate of another Person if they have the same firstname, regardless of other instance variables. So naturally the obvious thing to do is insert all Person objects into a set, and define a __hash__ and __eq__ method so that Person objects are compared by their firstname.

An alternate option would be to create a dictionary of Person objects, and use a separately created firstname string as the key. The drawback here is that I'd be duplicating the firstname string. This isn't really a problem in most cases, but what if I have 10,000,000 Person objects? The redundant string storage could really start adding up in terms of memory usage.

But if two Person objects compare equally, I need to be able to retrieve the original object so that the additional instance variables (aside from firstname) can be merged in a way required by the business logic. Which brings me back to my problem: I need some way to retrieve instances from a set.

Is there anyway to do this? Or is using a dictionary the only real option here?

like image 325
Channel72 Avatar asked May 12 '11 14:05

Channel72


2 Answers

I'd definitely use a dictionary here. Reusing the firstname instance variable as a dictionary key won't copy it -- the dictionary will simply use the same object. I doubt a dictionary will use significantly more memory than a set.

To actually save memory, add a __slots__ attribute to your classes. This will prevent each of you 10,000,000 instances from having a __dict__ attribute, which will save much more memory than the potential overhead of a dict over a set.

Edit: Some numbers to back my claims. I defined a stupid example class storing pairs of random strings:

def rand_str():
    return str.join("", (chr(random.randrange(97, 123))
                         for i in range(random.randrange(3, 16))))

class A(object):
    def __init__(self):
        self.x = rand_str()
        self.y = rand_str()
    def __hash__(self):
        return hash(self.x)
    def __eq__(self, other):
        return self.x == other.x

The amount of memory used by a set of 1,000,000 instances of this class

random.seed(42)
s = set(A() for i in xrange(1000000))

is on my machine 240 MB. If I add

    __slots__ = ("x", "y")

to the class, this goes down to 112 MB. If I store the same data in a dictionary

def key_value():
    a = A()
    return a.x, a

random.seed(42)
d = dict(key_value() for i in xrange(1000000))

this uses 249 MB without __slots__ and 121 MB with __slots__.

like image 162
Sven Marnach Avatar answered Sep 28 '22 08:09

Sven Marnach


Yes, you can do this: A set can be iterated over. But note that this is an O(n) operation as opposed to the O(1) operation of the dict.

So, you have to trade off speed versus memory. This is a classic. I personally would optimize for here (i.e. use the dictionary), since memory won't get short so quickly with only 10,000,000 objects and using dictionaries is really easy.

As for additional memory consumption for the firstname string: Since strings are immutable in Python, assigning the firstname attribute as a key will not create a new string, but just copy the reference.

like image 21
Daren Thomas Avatar answered Sep 28 '22 08:09

Daren Thomas