Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Constant time random selection and deletion

I'm trying to implement an edge list for a MultiGraph in Python.

What I've tried so far:

>>> l1 = Counter({(1, 2): 2, (1, 3): 1})
>>> l2 = [(1, 2), (1, 2), (1, 3)]

l1 has constant-time deletion of all edges between two vertices (e.g. del l1[(1, 2)]) but linear-time random selection on those edges (e.g. random.choice(list(l1.elements()))). Note that you have to do a selection on elements (vs. l1 itself).

l2 has constant-time random selection (random.choice(l2)) but linear-time deletion of all elements equal to a given edge ([i for i in l2 if i != (1, 2)]).

Question: is there a Python data structure that would give me both constant-time random selection and deletion?

like image 881
MikeRand Avatar asked Mar 23 '26 17:03

MikeRand


1 Answers

I don't think what you're trying to do is achievable in theory.

If you're using weighted values to represent duplicates, you can't get constant-time random selection. The best you could possibly do is some kind of skip-list-type structure that lets you binary-search the element by weighted index, which is logarithmic.

If you're not using weighted values to represent duplicates, then you need some structure that allows you to store multiple copies. And a hash table isn't going to do it—the dups have to be independent objects (e.g., (edge, autoincrement)),, meaning there's no way to delete all that match some criterion in constant time.

If you can accept logarithmic time, the obvious choice is a tree. For example, using blist:

>>> l3 = blist.sortedlist(l2)

To select one at random:

>>> edge = random.choice(l3)

The documentation doesn't seem to guarantee that this won't do something O(n). But fortunately, the source for both 3.3 and 2.7 shows that it's going to do the right thing. If you don't trust that, just write l3[random.randrange(len(l3))].

To delete all copies of an edge, you can do it like this:

>>> del l3[l3.bisect_left(edge):l3.bisect_right(edge)]

Or:

>>> try:
...     while True:
...         l3.remove(edge)
... except ValueError:
...     pass

The documentation explains the exact performance guarantees for every operation involved. In particular, len is constant, while indexing, slicing, deleting by index or slice, bisecting, and removing by value are all logarithmic, so both operations end up logarithmic.

(It's worth noting that blist is a B+Tree; you might get better performance out of a red-black tree, or a treap, or something else. You can find good implementations for most data structures on PyPI.)


As pointed out by senderle, if the maximum number of copies of an edge is much smaller than the size of the collection, you can create a data structure that does it in time quadratic on the maximum number of copies. Translating his suggestion into code:

class MGraph(object):
    def __init__(self):
        self.edgelist = []
        self.edgedict = defaultdict(list)
    def add(self, edge):
        self.edgedict[edge].append(len(self.edgelist))
        self.edgelist.append(edge)
    def remove(self, edge):
        for index in self.edgedict.get(edge, []):
            maxedge = len(self.edgelist) - 1
            lastedge = self.edgelist[maxedge]
            self.edgelist[index], self.edgelist[maxedge] = self.edgelist[maxedge], self.edgelist[index]
            self.edgedict[lastedge] = [i if i != maxedge else index for i in self.edgedict[lastedge]]
            del self.edgelist[-1]
        del self.edgedict[edge]
    def choice(self):
        return random.choice(self.edgelist)

(You could, of course, change the replace-list-with-list-comprehension line with a three-liner find-and-update-in-place, but that's still linear in the number of dups.)

Obviously, if you plan to use this for real, you may want to beef up the class a bit. You can make it look like a list of edges, a set of tuples of multiple copies of each edge, a Counter, etc., by implementing a few methods and letting the appropriate collections.abc.Foo/collections.Foo fill in the rest.


So, which is better? Well, in your sample case, the average dup count is half the size of the list, and the maximum is 2/3rds the size. If that were true for your real data, the tree would be much, much better, because log N will obviously blow away (N/2)**2. On the other hand, if dups were rare, senderle's solution would obviously be better, because W**2 is still 1 if W is 1.

Of course for a 3-element sample, constant overhead and multipliers are going to dominate everything. But presumably your real collection isn't that tiny. (If it is, just use a list...)

If you don't know how to characterize your real data, write both implementations and time them with various realistic inputs.

like image 166
abarnert Avatar answered Mar 25 '26 05:03

abarnert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!