Memory efficiency: One large dictionary or a dictionary of smaller dictionaries?

Tags:

I'm writing an application in Python (2.6) that requires me to use a dictionary as a data store.

I am curious as to whether or not it is more memory efficient to have one large dictionary, or to break that down into many (much) smaller dictionaries, then have an "index" dictionary that contains a reference to all the smaller dictionaries.

I know there is a lot of overhead in general with lists and dictionaries. I read somewhere that python internally allocates enough space that the dictionary/list # of items to the power of 2.

I'm new enough to python that I'm not sure if there are other unexpected internal complexities/suprises like that, that is not apparent to the average user that I should take into consideration.

One of the difficulties is knowing how the power of 2 system counts "items"? Is each key:pair counted as 1 item? That's seems important to know because if you have a 100 item monolithic dictionary then space 100^2 items would be allocated. If you have 100 single item dictionaries (1 key:pair) then each dictionary would only be allocation 1^2 (aka no extra allocation)?

Any clearly laid out information would be very helpful!

844

asked Mar 22 '09 18:03

Brandon K

2 Answers

If you're using Python, you really shouldn't be worrying about this sort of thing in the first place. Just build your data structure the way it best suits your needs, not the computer's.

This smacks of premature optimization, not performance improvement. Profile your code if something is actually bottlenecking, but until then, just let Python do what it does and focus on the actual programming task, and not the underlying mechanics.

answered Oct 12 '22 19:10

Soviut

Three suggestions:

Use one dictionary.
It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing.
Optimize later.
If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance.
Read up on hash tables.
Dictionaries are hash tables, and if you are worried about their time or space overhead, you should read up on how they're implemented. This is basic computer science. The short of it is that hash tables are:
- average case O(1) lookup time
- O(n) space (Expect about 2n, depending on various parameters)
I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables:
1. O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size.
2. O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces. Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not.
Here are some more resources that might help:
- The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables.
- The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why "the amount of space used by the hash table is proportional to the number of associations in the table". This might interest you.
Here are some things you might consider if you find you actually need to optimize your dictionary implementation:
- Here is the C source code for Python's dictionaries, in case you want ALL the details. There's copious documentation in here:
  - dictobject.h
  - dictobject.c
- Here is a python implementation of that, in case you don't like reading C.
  (Thanks to Ben Peterson)
- The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up. Note there's a tradeoff between your load factor and how frequently you need to rehash. Rehashes can be costly.

130

answered Oct 12 '22 20:10

Todd Gamblin

Related questions
                            
                                Protected method in python [duplicate]
                            
                                PySerial non-blocking read loop
                            
                                Renaming multiple files in a directory using Python
                            
                                Change Django ModelChoiceField to show users' full names rather than usernames
                            
                                How can I open a website with urllib via proxy in Python?
                            
                                IndentationError: unexpected unindent WHY?
                            
                                Exiting Python Debugger ipdb
                            
                                PrettyPrint python into a string, and not stdout
                            
                                How to make a histogram from a list of strings in Python?
                            
                                Parse http GET and POST parameters from BaseHTTPHandler?
                            
                                graphing an equation with matplotlib
                            
                                I can't seem to get --py-files on Spark to work
                            
                                PyCharm doesn't recognise installed module
                            
                                printing bold, colored, etc., text in ipython qtconsole
                            
                                Setting axes.linewidth without changing the rcParams global dict
                            
                                gcc: error trying to exec 'cc1plus': execvp: No such file or directory
                            
                                Python "best formatting practice" for lists, dictionary, etc
                            
                                Slicing a list into a list of sub-lists [duplicate]
                            
                                NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?
                            
                                How to make Django's devserver public ? Is it generally possible?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory efficiency: One large dictionary or a dictionary of smaller dictionaries?

Tags:

performance

python

dictionary

memory

Brandon K

People also ask

2 Answers

Soviut

Todd Gamblin

Recent Activity

Donate For Us