Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy: calculate cumulative median

I have sample with size = n.

I want to calculate for each i: 1 <= i <= n median for sample[:i] in numpy. For example, I counted mean for each i:

cummean = np.cumsum(sample) / np.arange(1, n + 1)

Can I do something similar for the median without cycles and comprehension?

like image 928
Artem Kupriyanov Avatar asked Mar 13 '17 14:03

Artem Kupriyanov


People also ask

How do you find a cumulative median?

We need to calculate the cumulative frequencies to find the median. Since n is even, we will find the average of the n/2th and the (n/2 +1)th observation i.e. the cumulative frequency greater than 40 is 63 and the class is 40 - 60. Hence, the median class is 40 - 60. Therefore, the median is 45.5.

How is median calculated in NumPy?

When used a median() on the multi-dimensional NumPy array, it by default returns the middle values of all elements reason being by default, the median is computed of the flattened array. In the following example, 14 and 15 are middle values hence, it returns 14.5 which is the average of these two values.


1 Answers

Knowing that Python has a heapq module that lets you keep a running 'minimum' for an iterable, I did a search on heapq and median, and found various items for steaming medium. This one:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

has a class streamMedian that maintains two heapq, one with the bottom half of the values, the other with top half. The median is either the 'top' of one or the mean of values from both. The class has an insert method and a getMedian method. Most of the work is in the insert.

I copied that into an Ipython session, and defined:

def cummedian_stream(b):
    S=streamMedian()
    ret = []
    for item in b:
        S.insert(item)
        ret.append(S.getMedian())
    return np.array(ret)

Testing:

In [155]: a = np.random.randint(0,100,(5000))
In [156]: amed = cummedian_stream(a)
In [157]: np.allclose(cummedian_sorted(a), amed)
Out[157]: True
In [158]: timeit cummedian_sorted(a)
1 loop, best of 3: 781 ms per loop
In [159]: timeit cummedian_stream(a)
10 loops, best of 3: 39.6 ms per loop

The heapq stream approach is way faster.


The list comprehension that @Uriel gave is relatively slow. But if I substitute np.median for statistics.median it is faster than @Divakar's sorted solution:

def fastloop(a):
    return np.array([np.median(a[:i+1]) for i in range(len(a))])

In [161]: timeit fastloop(a)
1 loop, best of 3: 360 ms per loop

And @Paul Panzer's partition approach is also good, but still slow compared to the streaming class.

In [165]: timeit cummedian_partition(a)
1 loop, best of 3: 391 ms per loop

(I could copy the streamMedian class to this answer if needed).

like image 161
hpaulj Avatar answered Sep 25 '22 13:09

hpaulj