I have a tree data structure, where each node can have multiple children. So there is not only a left and a right, but either less or even way more. Now I want to pick a node from this tree randomly. For every node, I know how many children are connected to it. But how can I pick them in a random fashion, uniformly would be awesome. Any ideas? I found solutions for the case of only a left and a right child, but as I said, that is not really applicable here.
Here is an observation that might be useful: suppose that you number all of the nodes in the tree in some fashion that makes it possible to efficiently look up the nth tree node for some arbitrary n. If you can do this, then you can efficiently pick a random node by choosing a random node number, then going to that node.
One very simple way to do this would be to perform a DFS or other traversal of the tree and store all the nodes in a dynamic array. You can then do O(1)-time random sampling by just indexing into the array. However, this has O(n) memory overhead and is not good if the tree is constantly changing.
If the tree is rapidly changing, another way to number the nodes that decreases the time required to recalculate indices is the following. Start off by numbering the root node 0. Then, recursively number the nodes in the first subtree, then he second, etc. Rather than storing this numbering explicitly, you can store it implicitly by having each tree node store the total number of nodes in the subtree rooted at that node. That way, to look up the nth node in the tree, you can do the following:
This approach runs very quickly if you have a relatively balanced tree with a reasonable branching factor, since you can rapidly discard parts of the tree that don't contain the nth element.
Using this approach, you get a very fast method (though not O(1)) for picking the nth element from the tree: choose a random index, then return the node at that index. Additionally, this works even if nodes are added or removed from the tree. Whenever a node is inserted, simply increment the count of all nodes on the path from the root to that node. Whenever a node is deleted, decrement the count of all nodes on the path from the root to the deleted node.
This approach still uses O(n) overhead to store the counts, though. For an O(1)-overhead algorithm that runs in linear time, look at @NPE's excellent solution based on reservoir sampling.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With