Is it possible to give a python dict an initial capacity (and is it useful)

Tags:

I am filling a python dict with around 10,000,000 items. My understanding of dict (or hashtables) is that when too much elements get in them, the need to resize, an operation that cost quite some time.

Is there a way to say to a python dict that you will be storing at least n items in it, so that it can allocate memory from the start? Or will this optimization not do any good to my running speed?

(And no, I have not checked that the slowness of my small script is because of this, I actually wouldn't now how to do that. This is however something I would do in Java, set the initial capacity of the HashSet right)

213

asked Jun 11 '10 06:06

Peter Smit

2 Answers

First off, I've heard rumor that you can set the size of a dictionary at initialization, but I have never seen any documentation or PEP describing how this would be done.

With this in mind I ran an analysis on your quantity of items, described below. While it may take some time to resize the dictionary each time I would recommend moving ahead without worrying about it, at least until you can test its performance.

The two rules that concern us in determining resizing is number of elements and factor of resizing. A dictionary will resize itself when it is 2/3 full on the addition of the element putting it over the 2/3 mark. Below 50,000 elements it will increase by a factor of 4, above that amount by a factor of 2. Using your estimate of 10,000,000 elements (between 2^23 and 2^24) your dictionary will resize itself 15 times (7 times below 50k, 8 times above). Another resize would occur just past 11,100,000.

Resizing and replacing the current elements in the hashtable does take some time, but I wonder if you'd notice it with whatever else you have going on in the code nearby. I just put together a timing suite comparing inserts at five places along each boundary from dictionary sizes of 2^3 through 2^24, and the "border" additions average 0.4 nanoseconds longer than the "non-border" additions. This is 0.17% longer... probably acceptable. The minimum for all operations was 0.2085 microseconds, and max was 0.2412 microseconds.

Hope this is insightful, and if you do check the performance of your code please follow-up with an edit! My primary resource for dictionary internals was the splendid talk given by Brandon Rhodes at PyCon 2010: The Mighty Dictionary

103

answered Oct 17 '22 08:10

stw_dev

Yes you can and here is a solution I found in another person's question that is related to yours too:

d = {}
for i in xrange(4000000):
d[i] = None
# 722ms

d = dict(itertools.izip(xrange(4000000), itertools.repeat(None)))
# 634ms

dict.fromkeys(xrange(4000000))
# 558ms

s = set(xrange(4000000))
dict.fromkeys(s)
# Not including set construction 353ms

those are different ways to initialize a dictionary with a certain size.

answered Oct 17 '22 07:10

Mahmoud Ayman

Related questions
                            
                                python os.environ, os.putenv, /usr/bin/env
                            
                                How can I make PyInstaller's .spec files actually portable? (woes absolute path for 'pathex' parameter)
                            
                                Default kwarg values for Python's str.format() method
                            
                                Unexpected keyword argument "context" when using appcfg.py
                            
                                Play Animations in GIF with Tkinter [duplicate]
                            
                                Intellij/Pycharm can't debug Python modules
                            
                                How can I reuse exception handling code for multiple functions in Python?
                            
                                How to perform JPEG compression in Python without writing/reading
                            
                                Flask, Python and Socket.io: multithreading app is giving me "RuntimeError: working outside of request context"
                            
                                PySpark DataFrames - way to enumerate without converting to Pandas?
                            
                                How to make the command-line / interpreter pane/window bigger in pudb?
                            
                                Pandas groupby and make set of items
                            
                                Assign external function to class variable in Python
                            
                                Installation: Reportlab: "ImportError: No module named reportlab.lib"
                            
                                after pip successful installed: ModuleNotFoundError
                            
                                Pandas DataFrame column naming conventions
                            
                                unable to decode Python web request
                            
                                Is it normal that running python under valgrind shows many errors with memory?
                            
                                XML Parsing with Python and minidom
                            
                                Is there a way to transparently perform validation on SQLAlchemy objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to give a python dict an initial capacity (and is it useful)

Tags:

python

dictionary

capacity

Peter Smit

People also ask

2 Answers

stw_dev

Mahmoud Ayman

Recent Activity

Donate For Us