I have a dict with approximately 17,000 keys. I would like to select one key at a time--it doesn't matter which one, and I don't need it to happen in any particular order (random is fine). However, after I select a key, I will alter the dictionary, perhaps by adding or deleting a key, before selecting another one. Therefore, I do not have a set list of keys that I can iterate through.
Since I don't need to access them in any particular order, I could convert the dict keys into a list each time, and then pop the first element. However, since there are 17,000 keys, making a list takes approximately 0.0005-7 seconds over each iteration, which will take too much time for what I need. Is there a shortcut I could take so that I don't have to compile an enormous list out of dict keys each time I want to select a single key?
The reason is dictionaries are very fast, implemented using a technique called hashing, which allows us to access a value very quickly. By contrast, the list of tuples implementation is slow. If we wanted to find a value associated with a key, we would have to iterate over every tuple, checking the 0th element.
Use random. choice() to get a random entry items() on a dictionary to return an iterable of its entries. Call list(iterable) with iterable to convert this iterable to a list. Call random.
There are multiple ways, but you'll need to make some tradeoffs. One way is to empty the dictionary out using popitem; it is atomic, and will use an arbitrary order. But it modifies the dictionary itself; whatever item was selected isn't in it anymore. The next method that comes to mind is iterating as usual, even while modifying the dictionary; the order of items might change, so you could get items any number of times. To track that, you could build a second set of visible keys. It's reasonably cheap to add keys to the set, cheap to check if each item is in it, and when you've gone through the whole dictionary you can check if the set matches the dictionary's keys to determine if there are ones you missed (or removed). You do end up building a key set but only one item per iteration; in the pessimal case we have the dictionary being modified in such a way we scan through the whole set of visited items before finding the new item.
Is there a reason this data needs to be kept in a dictionary only? For instance, if we consider a system where we're shuffling songs, we might not want to visit the whole library but only place a limit on how recently a song has been played. That could be more efficiently handled using a list of songs wherein we can read a random index, a set of recently played songs to avoid duplicates, and a queue (perhaps in a list or deque) of songs allowing us to update the set in order (removing the last entry each iteration). Bear in mind that references are reasonably cheap.
Rethinking one more step we wouldn't need the keys to check for duplicates if they simply aren't in our candidates; by just swapping the oldest played song with the randomly selected next song, both the played and candidate lists stay constant size and no lookups are needed since songs are in only one of the lists.
Another idea is to use collections.ChainMap to keep a consistent view into two dictionaries; ones that have been visited and ones that have not. You could then migrate items from the latter to the former by way of popitem, ensuring a readable method of processing everything in the collection while keeping it dictionary-like.
def getnewitem(chainmap):
# Raises KeyError when finished
key,value=chainmap.maps[0].popitem()
chainmap.maps[1][key]=value
return key,value
As that means both dictionaries keep changing, it's likely not the fastest overall, but it maintains both a dictionarylike collection and a capability to process all items. It does lose the ability to directly delete items, since ChainMap cannot hide inherited mappings; you'd need to remove them from the backing dictionaries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With