Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A efficient quantiles algorithm/data structure that allows samples to be updated as they increment over time?

I'm looking for an efficient quantiles algorithm that allows sample values to be "upserted" or replaced as the value changes over time.

Let's say I have values for items 1-n. I'd like to put these into a quantiles algorithm that would efficiently store them. But then say at some point in the future, the value for item-i gets incremented. I'd like to remove the original value for item-i and replace it with the updated value. The specific use case is for a streaming system where the sample values are incrementing over time.

The closest I've seen to something like this is the t-Digest data structure. It stores sample values efficiently. The only thing it lacks is the ability to remove and replace a sample value.

I've also looked at Apache Quantiles Datasketch - it suffers from the same problem - no way to remove and replace a sample.

edit: thinking about this more, there wouldn't necessarily need to be a remove of the old value and an insertion of the incremented value. There might be a way to recalculate internal state more easily if there's a constraint that values can only be updated.

like image 717
marathon Avatar asked Jun 23 '20 01:06

marathon


1 Answers

If update time O(log n) and quantile compute time O(log n) are acceptable for you then one of solutions would be to implement any type of self-balanced binary tree (Splay tree, AVL-tree, Red-Black tree) while keeping a HashMap<Key, Node> in parallel to the tree structure (or if you know that your keys are e.g. numbers 0 to n-1, then you can just use an array for the same purposes). You will also need to keep a count of nodes in the subtree for each given node (which is possible with all of the mentioned self-balanced trees - it is a small addition to all methods which are doing updates on the nodes such as rotations, etc.).

Pseudo-code for updating value with key K, new value V would be:

Node node = find_node_in_hash_map_by_key(K); # O(1)
delete_node_keeping_subtree_counts_valid(node); # O(log n)
add_new_node_keeping_subtree_counts_valid(K, V); # O(log n)

Getting quantile q will be possible in O(log n) too because of the subtree sizes available in each node, because it pretty much gives you access to i-th element by size in O(log n) time. Pseudocode for that operation would look like:

# i-th element requested
node = root
while true:
    left = node.left_subtree
    left_count = 0
    if left is not None:
        left_count = left.nodes_count
    if i < left_count:
        node = left # select i-th element in the left subtree
    elif i == left_count:
        return node.value # we have exactly i elements in left subtree, so i-th value is in the current node
    else:
        i -= left_count + 1 # select element i - left_count - 1 from the right subtree
        node = node.right

I'm not aware of a good open-source JAVA solution for this data structure, but writing your own AVL tree is not that difficult (and Splay tree should be the easiest, just their worst case complexity is not O(log n), but on average they should be good).

like image 77
Alexander Pivovarov Avatar answered Oct 04 '22 03:10

Alexander Pivovarov