Let's say I have two dictionaries and I know want to measure the time needed to check if a key is in the dictionary. I tried to run this piece of code: <pre class="prettyprint"><code>from timeit import timeit dct1 = {str(i): 1 for i in range(10**7)} dct2 = {i: 1 for i in range(10**7)} print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8)) print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8)) </code></pre> Here are the results that I get: <pre class="prettyprint"><code>2.529034548999334 2.212983401999736 </code></pre> Now, let's say I try to mix integers and strings in both dictionaries, and measure access time again: <pre class="prettyprint"><code>dct1[7] = 1 dct2["7"] = 1 print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8)) print(timeit('7 in dct1', setup='from __main__ import dct1', number=10**8)) print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8)) print(timeit('"7" in dct2', setup='from __main__ import dct2', number=10**8)) </code></pre> I get something weird: <pre class="prettyprint"><code>3.443614432000686 2.6335261530002754 2.1873921409987815 2.272667104998618 </code></pre> The first value is much higher than what I had before (3.44 vs 2.52). However, the third value is basically the same as before (2.18 vs 2.21). Why is this happening? Can you reproduce the same thing or is this only me? Also, I can't understand the big difference between the first and the second value: it looks like it's more difficult to access a string key, but the same thing seems to apply only slightly to the second dictionary. Why? Update You don't even need to actually add a new key. All you need to do to see an increase in complexity is just checking if a key with different type exists!! This is much weirder than I thought. Look at the example here: <pre class="prettyprint"><code>from timeit import timeit dct1 = {str(i): 1 for i in range(10**7)} dct2 = {i: 1 for i in range(10**7)} print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8)) # 2.55 print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8)) # 2.26 7 in dct1 "7" in dct2 print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8)) # 3.34 print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8)) # 2.35 </code></pre>

Let me try to answer my own question. The dict implementation in CPython is optimised for lookups of str keys. Indeed, there are two different functions that are used to perform lookups: <ul> <li> <code>lookdict</code> is a generic dictionary lookup function that is used with all types of keys</li> <li> <code>lookdict_unicode</code> is a specialised lookup function used for dictionaries composed of str-only keys</li> </ul> Python will use the string-optimised version until a search for non-string data, after which the more general function is used. And it looks like you cannot even reverse the behaviour of a particular dict instance: once it starts using the generic function, you can't go back to using the specialised one!

Different access time to a value of a dictionary when mixing int and str keys

Tags:

python

dictionary

python-3.x

Let's say I have two dictionaries and I know want to measure the time needed to check if a key is in the dictionary. I tried to run this piece of code:

from timeit import timeit

dct1 = {str(i): 1 for i in range(10**7)}
dct2 = {i: 1 for i in range(10**7)}

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))

Here are the results that I get:

2.529034548999334
2.212983401999736

Now, let's say I try to mix integers and strings in both dictionaries, and measure access time again:

dct1[7] = 1
dct2["7"] = 1

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
print(timeit('7 in dct1', setup='from __main__ import dct1', number=10**8))
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))
print(timeit('"7" in dct2', setup='from __main__ import dct2', number=10**8))

I get something weird:

3.443614432000686
2.6335261530002754
2.1873921409987815
2.272667104998618

The first value is much higher than what I had before (3.44 vs 2.52). However, the third value is basically the same as before (2.18 vs 2.21). Why is this happening? Can you reproduce the same thing or is this only me? Also, I can't understand the big difference between the first and the second value: it looks like it's more difficult to access a string key, but the same thing seems to apply only slightly to the second dictionary. Why?

Update

You don't even need to actually add a new key. All you need to do to see an increase in complexity is just checking if a key with different type exists!! This is much weirder than I thought. Look at the example here:

from timeit import timeit

dct1 = {str(i): 1 for i in range(10**7)}
dct2 = {i: 1 for i in range(10**7)}

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
# 2.55
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))
# 2.26

7 in dct1
"7" in dct2

print(timeit('"7" in dct1', setup='from __main__ import dct1', number=10**8))
# 3.34
print(timeit('7 in dct2', setup='from __main__ import dct2', number=10**8))
# 2.35

488

asked Sep 21 '21 09:09

Riccardo Bucco

1 Answers

Let me try to answer my own question. The dict implementation in CPython is optimised for lookups of str keys. Indeed, there are two different functions that are used to perform lookups:

lookdict is a generic dictionary lookup function that is used with all types of keys
lookdict_unicode is a specialised lookup function used for dictionaries composed of str-only keys

Python will use the string-optimised version until a search for non-string data, after which the more general function is used.

And it looks like you cannot even reverse the behaviour of a particular dict instance: once it starts using the generic function, you can't go back to using the specialised one!

135

answered Oct 14 '22 00:10

Riccardo Bucco

Related questions
                            
                                Efficient way of looping through list of dictionaries and appending items into column in dataframe
                            
                                Mime Type Issue Loading CSS With Django App
                            
                                Can't upgrade Anaconda base to Python 3.8
                            
                                airflow initdb: undefined symbol: Py_GetArgcArgv
                            
                                BERT embedding for semantic similarity
                            
                                Why is Django a 'less secure' app according to Google?
                            
                                Duplicate layers when reusing pytorch model
                            
                                Why does pd.Series([np.nan]) | pd.Series([True]) evaluate to False?
                            
                                Read files with only specific names from Amazon S3
                            
                                mypy declares IO[bytes] incompatible with BinaryIO
                            
                                How to set a title above each marker which represents a same label
                            
                                Import Error: cannot import name 'ft2font' from partially initialized module 'matplotlib'
                            
                                Graph to connect sentences
                            
                                I need to change the type of few columns in a pandas dataframe. Can't do so using iloc
                            
                                Finding multiple substrings in a string without iterating over it multiple times
                            
                                How to Create Class Label for Mosaic Augmentation in Image Classification?
                            
                                Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
                            
                                How to acess tweets with bearer token using tweepy, in python?
                            
                                Why is mypy checking files that I've excluded?
                            
                                Is there any way I can change the way list() function works on my class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Different access time to a value of a dictionary when mixing int and str keys

Tags:

python

dictionary

python-3.x

Riccardo Bucco

People also ask

1 Answers

Riccardo Bucco

Recent Activity

Donate For Us