Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a random value in a very large python dictionary

Given a python dict with multiple million entries, what is the most efficient way to get and remove a random (k,v) pair from it?

The dict is constantly growing and the random remove function is called very often.

The most cited solution for python2 random_key = random.choice(the_dict.keys()) is way too slow as a list of all the keys is created first. With lots of elements in the dict, this solution does not work.

Another proposed solution is the_dict.popitem(), but this does not return a real random object, but depends on the internal ordering of the dict.

The third solution that is also way to slow is an iterator:

 it = the_dict.iterkeys()

 for i in range (random.randint(0, len(the_dict)-1)):
     next(it)
 random_key = next(it)

Next to the remove_random(), sometimes a the_dict.pop(x) is required for a specific key. Therefore, a simple list based secondary index does not work.

Can this problem be efficiently be solved with a dict?

like image 564
meatz Avatar asked Jan 11 '23 06:01

meatz


2 Answers

A solution is to use a bidirectional map each key to an integer, to allow for randomly selecting a key by using random.randrange(0,N) to select from a range of ints which are bidirectionally mapped to the keys, where N is the number of keys.

Adding a new key simply assigns it the next higher int. Deleting a key reassigns the int for that key to the key that was assigned the previously highest int before deleting the key-value pair. Python code provided for clarity.

Python code:

def create(D): # O(len(D))
    # Create the bidirectional maps from the dictionary, D
    keys = D.keys()
    ints = range(len(keys)
    int_to_key = dict(zip(keys, ints)) 
    key_to_int = dict(zip(ints, keys))
    return (int_to_key, key_to_int)

def add(D, int_to_key, key_to_int, key, value): # O(1)
    # Add key-value pair (no extra work needed for simply changing the value)
    new_int = len(D)
    D[key] = value
    int_to_key[new_int] = key
    key_to_int[key] = new_int

def remove(D, int_to_key, key_to_int, key): # O(1)
    # Update the bidirectional maps then remove the key-value pair

    # Get the two ints and keys.
    key_int = key_to_int[key]
    swap_int = len(D) - 1 # Should be the highest int
    swap_key = int_to_key[swap_int]

    # Update the bidirectional maps so that key now has the highest int
    key_to_int[key], key_to_int[swap_key] = swap_int, key_int
    int_to_key[key_int], int_to_key[swap_int] = swap_key, key

    # Remove elements from dictionaries
    D.remove(key)
    key_to_int.remove(key)
    int_to_key.remove(key)

def random_key(D, int_to_key): # O(1)
    # Select a random key from the dictionary using the int_to_key map
    return int_to_key[random.randrange(0, len(D))]

def remove_random(D, int_to_key, key_to_int): # O(1)
    # Randomly remove a key from the dictionary via the bidirectional maps
    key = random_key(D, int_to_key)
    remove(D, int_to_key, key_to_int, key)

Note: Adding/removing keys from D without using the appropriate above functions will break the bidirectional map. This means it's best to implement this as a class.

like image 96
Nuclearman Avatar answered Jan 28 '23 12:01

Nuclearman


No, as you've discovered, this can't be done efficiently with a plain dict. See this issue for some explanations about why implementing random.choice for sets is hard; the same arguments apply to dictionaries.

But it's possible to create a dict-like data structure that does support efficient random selection. Here's a recipe for such an object, based in part on this question and its responses. It's only a starting-point, but it supports most of the existing dict methods, many of which are conveniently filled in by the MutableMapping ABC. Depending on your needs, you may need to flesh it out a bit: for example, to be able to create a RandomChoiceDict directly from a regular dict, or to add a meaningful __repr__, etc.

Essentially, you need to maintain three structures: a list of keys, a list of corresponding values, and a dict that maps keys back to indices (the inverse of the keys list). The basic __getitem__, __setitem__ and __delitem__ operations can be simply implemented in terms of those structures, and if __len__ and __iter__ are specified, the abstract base class takes care of most of the rest.

from collections import MutableMapping
import random

class RandomChoiceDict(MutableMapping):
    """
    Dictionary-like object allowing efficient random selection.

    """
    def __init__(self):
        # Add code to initialize from existing dictionaries.
        self._keys = []
        self._values = []
        self._key_to_index = {}

    def __getitem__(self, key):
        return self._values[self._key_to_index[key]]

    def __setitem__(self, key, value):
        try:
            index = self._key_to_index[key]
        except KeyError:
            # Key doesn't exist; add a new one.
            index = len(self._keys)
            self._key_to_index[key] = index
            self._keys.append(key)
            self._values.append(value)
        else:
            # Key already exists; overwrite the value.
            self._values[index] = value

    def __delitem__(self, key):
        index = self._key_to_index.pop(key)
        # Remove *last* indexed element, then put
        # it back at position 'index' (overwriting the
        # one we're actually removing) if necessary.
        key, value = self._keys.pop(), self._values.pop()
        if index != len(self._key_to_index):
            self._keys[index] = key
            self._values[index] = value
            self._key_to_index[key] = index

    def __len__(self):
        return len(self._key_to_index)

    def __iter__(self):
        return iter(self._keys)

    def random_key(self):
        """Return a randomly chosen key."""
        if not self:
            raise KeyError("Empty collection")
        index = random.randrange(len(self))
        return self._keys[index]

    def popitem_random(self):
        key = self.random_key()
        value = self.pop(key)
        return key, value

Example usage:

>>> d = RandomChoiceDict()
>>> for x in range(10**6):  # populate with some values
...     d[x] = x**2
... 
>>> d.popitem_random()  # remove and return random item
(132545, 17568177025)
>>> 132545 in d
False
>>> d.popitem_random()
(954424, 910925171776)
like image 30
Mark Dickinson Avatar answered Jan 28 '23 13:01

Mark Dickinson