Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do namedtuples use less memory than dictionaries?

I'm asking this because I found it surprising -- I thought a namedtuple would have more overhead.

(The background is I was caching a large Django query in memory and found Django objects to be 100x the size of .values(). I then wondered what overhead namedtuple versions of the objects would be, allowing me to still use . access to the items as attributes. Smaller was not what I expected.)

#!/usr/bin/env python                                                           

from pympler.asizeof import asizeof                                             
from collections import namedtuple                                              

import random                                                                   
import string                                                                   

QTY = 100000                                                                    


class Foz(object):                                                              
    pass                                                                        

dicts = [{'foo': random.randint(0, 10000),                                      
          'bar': ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(32)]),
          'baz': random.randrange(10000),                                       
          'faz': random.choice([True, False]),                                  
          'foz': Foz()} for _ in range(QTY)]                                    

print "%d dicts: %d" % (len(dicts), asizeof(dicts))                             

# https://stackoverflow.com/questions/43921240/pythonic-way-to-convert-dictionary-to-namedtuple-or-another-hashable-dict-like

MyTuple = namedtuple('MyTuple', sorted(dicts[0]))                               

tuples = [MyTuple(**d) for d in dicts]                                          

print "%d namedtuples: %d" % (len(tuples), asizeof(tuples))                     

print "Ratio: %.01f" % (float(asizeof(tuples)) / float(asizeof(dicts))) 

Running,

$ ./foo.py    
100000 dicts: 75107672
100000 namedtuples: 56707472
Ratio: 0.8

A single tuple is even less, perhaps due to the overhead of list:

$ ./foo.py    
1 dicts: 1072
1 namedtuples: 688
Ratio: 0.6

Is it the overhead of the hashtable array? But wouldn't a namedtuple also need a hashtable of the attributes? Is pympler not being accurate?

like image 824
rrauenza Avatar asked Dec 23 '22 23:12

rrauenza


2 Answers

The basic answer is simply "yes": A normal object has an internal dictionary to store the instance's attributes:

class Foo:
    pass

f = Foo()
print(f.__dict__)
# {}

It needs to be a dict because in Python you are allowed to assign new attributes on an instance that were not defined by the class:

f.a = 1
print(f.__dict__)
# {'a': 1}

Using a dict allows fast attribute lookups, but there is memory overhead due to the data structure itself. Also, because different instances of Foo may have different attributes defined on them, every single instance might need its own dict:

g = Foo()
print(g.__dict__)
# {}
print(f.__dict_ == g.__dict__)
# False

A namedtuple does not allow adding attributes at runtime. A specific instance of namedtuple can, therefore, store all of its attributes in a single instance being shared by all instances.

Given a namedtuple and an instance:

Foo = collections.namedtuple("Foo", 'a,b')
f = Foo(1,2)

The namedtuple-constructor generates a descriptor for each field and stores it in the class; here is where the translation between named attribute and tuple index is stored. When you access attribute a on instance f, the attribute access is routed through this descriptor:

type(Foo.a)
#<class 'property'>
like image 132
user2722968 Avatar answered Jan 15 '23 13:01

user2722968


But wouldn't a namedtuple also need a hashtable of the attributes?

Nope. A namedtuple's instance layout is exactly the same as a regular tuple. The mapping from tuple entries to attributes is provided by generated descriptors, which are a mechanism Python provides for controlling attribute resolution. The descriptors are stored in the generated namedtuple type, so they're a per-type cost, not per-instance. Currently, the descriptors are property objects, as you can see in the current implementation, but that's subject to change (especially if any of this gets rewritten in C).

A namedtuple gets to be way more memory-efficient than a dict because as far as memory layout goes, it's just a tuple.

like image 20
user2357112 supports Monica Avatar answered Jan 15 '23 13:01

user2357112 supports Monica