I'm working on a problem where I need to maintain the median of a data stream, but unlike typical implementations that only support insertions, I also need to support deletions of arbitrary values.
This works well when we’re only inserting values, but I’m not sure how to extend this approach efficiently when values can also be deleted from the stream (not necessarily from the top of the heaps).
I’m looking for a data structure or strategy that can:
Insert numbers in O(log n)
Delete any number in O(log n) or O(n) worst case (ideally log time)
Return the median in O(1) or O(log n)
A known approach with just insertion maintains two binary heaps: a min-heap and max-heap. The min-heap holds the higher values and the max-heap holds the lower values. If the sizes of both heaps are equal, then the current median is the average of the two root values in these heaps. Otherwise, the median is the root value of the larger heap. If a next value to be inserted is less than the current median, it is inserted in the max-heap, otherwise in the min-heap. If the sizes of the heaps then differ by more than one, the root of the larger heap is extracted, and its value inserted in the smaller heap. This ensures that the two heap sizes differ by at most 1.
To make this work also with deletions you can use binary heap data structures that can delete in O(logn) complexity.
This can for instance be done by maintaining a hashmap that maps each value to the index where it currently sits in the respective heap. With every heap manipulation (sifting down/up) this hashmap should be kept synchronised. With this information you can locate the value to be deleted and swap it with the last entry/leaf of the heap and then sift up/down the value that got moved in the position of the deleted value, so to maintain the heap property. The heap's size should be reduced by one, and the two heap sizes should be compared again (like after an insertion) to see if an element should be moved from one heap to the other.
If your input stream has duplicate values, then the hashmap entries should be sets of indices, so that for one value you can store multiple indices (where they occur). When deleting, one index is taken from that set and removed. If a set becomes empty, only then the corresponding hashmap entry can be removed.
The data structure that meets your requirements is called an order statistic tree. Typically this is just some kind of binary search tree, like a red-black tree, in which you additionally keep track of the size of each node's subtree.
Unfortunately, language standard libraries don't typically come with an order statistic tree implementation, so you would have to roll your own.
Alternatively, you can use the technique that both @trincot and @Stef have mentioned, where you keep separate data structures for the elements above and below the median, but use whatever kind of sorted collection your language provides. In C++ you would use a pair of std::map<whatever, unsigned> that maps elements to their counts.
If your language doesn't have a sorted collection, so you really have to make your own, then the 2-heaps-with-deletions approach is easiest. You can do it the way that @trincot suggests.
If amortized complexity is acceptable, though, then it's easier to support deletions by adding "cancel" elements to the heaps. Instead of deleting x, for example, you would add an element cancel x. The cancel elements sort such that they are removed immediately before the equivalent element, so when you pop the heap, if you get a cancel x, then you just keep popping until you've consumed as many x elements as cancel x elements. This can make your heaps grow without bounds, so you have to rebuild the heap when the number of cancel elements exceeds the number of added elements. The amortized cost of this is O(1) per insertion or deletion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With