In general, Python sets don't seem to be designed for retrieving items by key. That's obviously what dictionaries are for. But is there anyway that, given a key, you can retrieve an instance from a set which is equal to the key?
Again, I know this is exactly what dictionaries are for, but as far as I can see, there are legitimate reasons to want to do this with a set. Suppose you have a class defined something like:
class Person:
def __init__(self, firstname, lastname, age):
self.firstname = firstname
self.lastname = lastname
self.age = age
Now, suppose I am going to be creating a large number of Person
objects, and each time I create a Person
object I need to make sure it is not a duplicate of a previous Person
object. A Person
is considered a duplicate of another Person
if they have the same firstname
, regardless of other instance variables. So naturally the obvious thing to do is insert all Person
objects into a set, and define a __hash__
and __eq__
method so that Person
objects are compared by their firstname
.
An alternate option would be to create a dictionary of Person
objects, and use a separately created firstname
string as the key. The drawback here is that I'd be duplicating the firstname
string. This isn't really a problem in most cases, but what if I have 10,000,000 Person
objects? The redundant string storage could really start adding up in terms of memory usage.
But if two Person
objects compare equally, I need to be able to retrieve the original object so that the additional instance variables (aside from firstname
) can be merged in a way required by the business logic. Which brings me back to my problem: I need some way to retrieve instances from a set
.
Is there anyway to do this? Or is using a dictionary the only real option here?
I'd definitely use a dictionary here. Reusing the firstname
instance variable as a dictionary key won't copy it -- the dictionary will simply use the same object. I doubt a dictionary will use significantly more memory than a set.
To actually save memory, add a __slots__
attribute to your classes. This will prevent each of you 10,000,000 instances from having a __dict__
attribute, which will save much more memory than the potential overhead of a dict
over a set
.
Edit: Some numbers to back my claims. I defined a stupid example class storing pairs of random strings:
def rand_str():
return str.join("", (chr(random.randrange(97, 123))
for i in range(random.randrange(3, 16))))
class A(object):
def __init__(self):
self.x = rand_str()
self.y = rand_str()
def __hash__(self):
return hash(self.x)
def __eq__(self, other):
return self.x == other.x
The amount of memory used by a set of 1,000,000 instances of this class
random.seed(42)
s = set(A() for i in xrange(1000000))
is on my machine 240 MB. If I add
__slots__ = ("x", "y")
to the class, this goes down to 112 MB. If I store the same data in a dictionary
def key_value():
a = A()
return a.x, a
random.seed(42)
d = dict(key_value() for i in xrange(1000000))
this uses 249 MB without __slots__
and 121 MB with __slots__
.
Yes, you can do this: A set
can be iterated over. But note that this is an O(n) operation as opposed to the O(1) operation of the dict.
So, you have to trade off speed versus memory. This is a classic. I personally would optimize for here (i.e. use the dictionary), since memory won't get short so quickly with only 10,000,000 objects and using dictionaries is really easy.
As for additional memory consumption for the firstname
string: Since strings are immutable in Python, assigning the firstname
attribute as a key will not create a new string, but just copy the reference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With