Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling median algorithm in C

People also ask

What is a rolling median?

A rolling median is the median of a certain number of previous periods in a time series.

Which one is the best suited data structure to calculate running median?

The idea is to use max heap and min-heap data structure to store the elements of the lower half and higher half respectively. We use a max heap to represent elements that are less than effective median, and a min heap to represent elements that are greater than effective median.

How do you find the median of streaming data?

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value and the median is the mean of the two middle values. For example, for arr = [2,3,4] , the median is 3 . For example, for arr = [2,3] , the median is (2 + 3) / 2 = 2.5 .


I have looked at R's src/library/stats/src/Trunmed.c a few times as I wanted something similar too in a standalone C++ class / C subroutine. Note that this are actually two implementations in one, see src/library/stats/man/runmed.Rd (the source of the help file) which says

\details{
  Apart from the end values, the result \code{y = runmed(x, k)} simply has
  \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
  efficiently.

  The two algorithms are internally entirely different:
  \describe{
    \item{"Turlach"}{is the Härdle-Steiger
      algorithm (see Ref.) as implemented by Berwin Turlach.
      A tree algorithm is used, ensuring performance \eqn{O(n \log
        k)}{O(n * log(k))} where \code{n <- length(x)} which is
      asymptotically optimal.}
    \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
      which makes use of median \emph{updating} when one observation
      enters and one leaves the smoothing window.  While this performs as
      \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
      considerably faster for small \eqn{k} or \eqn{n}.}
  }
}

It would be nice to see this re-used in a more standalone fashion. Are you volunteering? I can help with some of the R bits.

Edit 1: Besides the link to the older version of Trunmed.c above, here are current SVN copies of

  • Srunmed.c (for the Stuetzle version)
  • Trunmed.c (for the Turlach version)
  • runmed.R for the R function calling these

Edit 2: Ryan Tibshirani has some C and Fortran code on fast median binning which may be a suitable starting point for a windowed approach.


I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link suggested by MAK ( Match Editorial: scroll down to FloatingMedian).

Two multisets

The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

Segment tree

The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!!

The c++ code is similar to what Denis posted above ("Here's a simple algorithm for quantized data")

GNU Order Statistic Trees

Just before giving up, I found that stdlibc++ contains order statistic trees!!!

These have two critical operations:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

See libstdc++ manual policy_based_data_structures_test (search for "split and join").

I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H

I've done a C implementation here. A few more details are in this question: Rolling median in C - Turlach implementation.

Sample usage:

int main(int argc, char* argv[])
{
   int i, v;
   Mediator* m = MediatorNew(15);
 
   for (i=0; i<30; i++) {
      v = rand() & 127;
      printf("Inserting %3d \n", v);
      MediatorInsert(m, v);
      v = MediatorMedian(m);
      printf("Median = %3d.\n\n", v);
      ShowTree(m);
   }
}

I use this incremental median estimator:

median += eta * sgn(sample - median)

which has the same form as the more common mean estimator:

mean += eta * (sample - mean)

Here eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta like this if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources use something like eta = 1 / n to converge, where n is the number of samples seen so far.)

Also, I modified the median estimator to make it work for arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1 - p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0, 1]. This essentially shifts the sgn() function's symmetrical output {-1, 0, 1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1 - p of the data are less than/greater than the quantile estimate, respectively). Note that for p = 0.5, this reduces to the median estimator.