Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient keys for dictionaries

I'm new to posting here, so I hope I provide enough detail. I'm trying to find out if key selection effects the efficiency of dictionaries in Python. Some comparisons I'm thinking of are:

  • numbers vs strings (e.g. would my_dict[20] be faster than my_dict['twenty'])
  • len of strings (e.g. would my_dict['a'] be faster than my_dict['abcdefg'])
  • mixing key types within a dictionary, for instance using numbers, strings, and/or tuples (e.g. would my_dict = {0: 'zero', 2: 'two'} perform faster than {0: 'zero', 'two': 2})

I haven't been able to find this topic from a google search, so I thought maybe someone here might know.

like image 875
ThatNewGuy Avatar asked Apr 29 '26 06:04

ThatNewGuy


1 Answers

First of all, I'd recommend you to understand How are Python's Built In Dictionaries Implemented.

Now, let's make a little random experiment to prove the theory (at least partially):

import timeit
import string
import random
import time


def random_str(N):
    return ''.join(
        random.choice(string.ascii_uppercase + string.digits) for _ in range(N)
    )


def experiment1(dct, keys):
    s = time.time()
    {dct[k] for k in keys}
    return time.time() - s


if __name__ == "__main__":
    N = 10000
    K = 200
    S = 50000
    dct1 = {random_str(K): None for k in range(N)}
    dct2 = {i: None for i in range(N)}
    keys1 = list(dct1.keys())
    keys2 = list(dct2.keys())

    samples1 = []
    samples2 = []
    for i in range(S):
        samples1.append(experiment1(dct1, keys1))
        samples2.append(experiment1(dct2, keys2))

    print(sum(samples1), sum(samples2))

    # print(
    #     timeit.timeit('{dct1[k] for k in keys1}', setup='from __main__ import dct1, keys1')
    # )

    # print(
    #     timeit.timeit('{dct2[k] for k in keys2}', setup='from __main__ import dct2, keys2')
    # )

The results I've got with different sampling sizes on my box were:

  • N=10000, K=200, S=100 => 0.08300423622131348 0.07200479507446289
  • N=10000, K=200, S=1000 => 0.885051965713501 0.7120392322540283
  • N=10000, K=200, S=10000 => 8.88549256324768 7.005417346954346
  • N=10000, K=200, S=50000 => 43.57453536987305 34.82594871520996

So as you can see, no matter whether you use big random strings to lookup dictionaries or integers, the performance will remain almost the same. The only "real" difference you'd like to consider would be in terms of memory consumption of both dictionaries. That may be relevant when dumping/loading huge dictionaries to/from disk, in that case it may be worth to have compact form of your data structures so you'll be able to shave off few seconds when caching/reading them.

NS: If anyone is able to explain why I was getting really huge times when using timeit (commented parts) please let me... with little experiment constants I'd get really high values... that's why I left it uncommented. Add a comment if you know the reason ;D

like image 123
BPL Avatar answered Apr 30 '26 20:04

BPL



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!