I do understand that querying a non-existent key in a defaultdict the way I do will add items to the defaultdict. That is why it is fair to compare my 2nd code snippet to my first one in terms of performance. <pre class="prettyprint"><code>import numpy as num from collections import defaultdict topKeys = range(16384) keys = range(8192) table = dict((k,defaultdict(int)) for k in topKeys) dat = num.zeros((16384,8192), dtype="int32") print "looping begins" #how much memory should this use? I think it shouldn't use more that a few #times the memory required to hold (16384*8192) int32's (512 mb), but #it uses 11 GB! for k in topKeys: for j in keys: dat[k,j] = table[k][j] print "done" </code></pre> What is going on here? Furthermore, this similar script takes eons to run compared to the first one, and also uses an absurd quantity of memory. <pre class="prettyprint"><code>topKeys = range(16384) keys = range(8192) table = [(j,0) for k in topKeys for j in keys] </code></pre> I guess python ints might be 64 bit ints, which would account for some of this, but do these relatively natural and simple constructions really produce such a massive overhead? I guess these scripts show that they do, so my question is: what exactly is causing the high memory usage in the first script and the long runtime and high memory usage of the second script and is there any way to avoid these costs? Edit: Python 2.6.4 on 64 bit machine. Edit 2: I can see why, to a first approximation, my table should take up 3 GB 16384*8192*(12+12) bytes and 6GB with a defaultdict load factor that forces it to reserve double the space. Then inefficiencies in memory allocation eat up another factor of 2. So here are my remaining questions: Is there a way for me to tell it to use 32 bit ints somehow? And why does my second code snippet take FOREVER to run compared to the first one? The first one takes about a minute and I killed the second one after 80 minutes.

Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem. The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.) A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, <code>16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB</code>. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some. Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation. As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.

Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)

Tags:

performance

python

memory

runtime

I do understand that querying a non-existent key in a defaultdict the way I do will add items to the defaultdict. That is why it is fair to compare my 2nd code snippet to my first one in terms of performance.

import numpy as num
from collections import defaultdict

topKeys = range(16384)
keys = range(8192)

table = dict((k,defaultdict(int)) for k in topKeys)

dat = num.zeros((16384,8192), dtype="int32")

print "looping begins"
#how much memory should this use? I think it shouldn't use more that a few
#times the memory required to hold (16384*8192) int32's (512 mb), but
#it uses 11 GB!
for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

print "done"

What is going on here? Furthermore, this similar script takes eons to run compared to the first one, and also uses an absurd quantity of memory.

topKeys = range(16384)
keys = range(8192)
table = [(j,0) for k in topKeys for j in keys]

I guess python ints might be 64 bit ints, which would account for some of this, but do these relatively natural and simple constructions really produce such a massive overhead? I guess these scripts show that they do, so my question is: what exactly is causing the high memory usage in the first script and the long runtime and high memory usage of the second script and is there any way to avoid these costs?

Edit: Python 2.6.4 on 64 bit machine.

Edit 2: I can see why, to a first approximation, my table should take up 3 GB 16384*8192*(12+12) bytes and 6GB with a defaultdict load factor that forces it to reserve double the space. Then inefficiencies in memory allocation eat up another factor of 2.

So here are my remaining questions: Is there a way for me to tell it to use 32 bit ints somehow?

And why does my second code snippet take FOREVER to run compared to the first one? The first one takes about a minute and I killed the second one after 80 minutes.

209

asked Apr 30 '10 20:04

user19511

3 Answers

Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem.

The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.)

A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, 16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some.

Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation.

As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.

170

answered Nov 09 '22 03:11

Mike Graham

Well, look at what your code is actually doing:

topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)

This creates a dict holding 16384 defaultdict(int)'s. A dict has a certain amount of overhead: the dict object itself is between 60 and 120 bytes (depending on the size of pointers and ssize_t's in your build.) That's just the object itself; unless the dict is less than a couple of items, the data is a separate block of memory, between 12 and 24 bytes, and it's always between 1/2 and 2/3rds filled. And defaultdicts are 4 to 8 bytes bigger because they have this extra thing to store. And ints are 12 bytes each, and although they're reused where possible, that snippet won't reuse most of them. So, realistically, in a 32-bit build, that snippet will take up 60 + (16384*12) * 1.8 (fill factor) bytes for the table dict, 16384 * 64 bytes for the defaultdicts it stores as values, and 16384 * 12 bytes for the integers. So that's just over a megabyte and a half without storing anything in your defaultdicts. And that's in a 32-bit build; a 64-bit build would be twice that size.

Then you create a numpy array, which is actually pretty conservative with memory:

dat = num.zeros((16384,8192), dtype="int32")

This will have some overhead for the array itself, the usual Python object overhead plus the dimensions and type of the array and such, but it wouldn't be much more than 100 bytes, and only for the one array. It does store 16384*8192 int32's in your 512Mb though.

And then you have this rather peculiar way of filling this numpy array:

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

The two loops themselves don't use much memory, and they re-use it each iteration. However, table[k][j] creates a new Python integer for each value you request, and stores it in the defaultdict. The integer created is always 0, and it so happens that that always gets reused, but storing the reference to it still uses up space in the defaultdict: the aforementioned 12 bytes per entry, times the fill factor (between 1.66 and 2.) That lands you close to 3Gb of actual data right there, and 6Gb in a 64-bit build.

On top of that the defaultdicts, because you keep adding data, have to keep growing, which means they have to keep reallocating. Because of Python's malloc frontend (obmalloc) and how it allocates smaller objects in blocks of its own, and how process memory works on most operating systems, this means your process will allocate more and not be able to free it; it won't actually use all of the 11Gb, and Python will re-use the available memory inbetween the large blocks for the defaultdicts, but the total mapped address space will be that 11Gb.

answered Nov 09 '22 03:11

Thomas Wouters

Mike Graham gives a good explanation of why dictionaries use more memory, but I thought that I'd explain why your table dict of defaultdicts starts to take up so much memory.

The way that the defaultdict (DD) is set-up right now, whenever you retrieve an element that isn't in the DD, you get the default value for the DD (0 for your case) but also the DD now stores a key that previously wasn't in the DD with the default value of 0. I personally don't like this, but that's how it goes. However, it means that for every iteration of the inner loop, new memory is being allocated which is why it is taking forever. If you change the lines

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

for k in topKeys:
    for j in keys:
        if j in table[k]:
            dat[k,j] = table[k][j]
        else:
            dat[k,j] = 0

then default values aren't being assigned to keys in the DDs and so the memory stays around 540 MB for me which is mostly just the memory allocated for dat. DDs are decent for sparse matrices though you probably should just use the sparse matrices in Scipy if that's what you want.

answered Nov 09 '22 03:11

Justin Peel

Related questions
                            
                                ModuleNotFoundError: No module named aws_cdk
                            
                                Python difference between yaml.load and yaml.safe_load
                            
                                What is the equivalent of Query.count in the SQLAlchemy 1.4 ORM?
                            
                                Create n rows per id | Pandas
                            
                                module 'tensorflow.compat.v2.__internal__' has no attribute 'tf2'
                            
                                Best way to extract data from a FileMaker Pro database in a script?
                            
                                Scaffolding in pylons
                            
                                Python for Autohotkey style key-combination sniffing, automation?
                            
                                urlsafe_b64encode always ends in '=' ?:
                            
                                Why does this pyd file not import on some computers?
                            
                                Python - Threading and a While True Loop
                            
                                how fast is python's slice
                            
                                Is there an way to programmatically read a file from a TrueCrypt disk into memory?
                            
                                Clojure equivalent to Python's lxml library?
                            
                                url builder for python
                            
                                How can I configure Geany to compile and run my Python programs?
                            
                                Python rounding issue
                            
                                blocking channels vs async message passing
                            
                                Python web development framework for python 3.1 user
                            
                                How to create a CFuncType in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)

Tags:

performance

python

memory

runtime

user19511

People also ask

3 Answers

Mike Graham

Thomas Wouters

Justin Peel

Recent Activity

Donate For Us