NumPy: calculate cumulative median

1 Answers

Knowing that Python has a heapq module that lets you keep a running 'minimum' for an iterable, I did a search on heapq and median, and found various items for steaming medium. This one:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

has a class streamMedian that maintains two heapq, one with the bottom half of the values, the other with top half. The median is either the 'top' of one or the mean of values from both. The class has an insert method and a getMedian method. Most of the work is in the insert.

I copied that into an Ipython session, and defined:

def cummedian_stream(b):
    S=streamMedian()
    ret = []
    for item in b:
        S.insert(item)
        ret.append(S.getMedian())
    return np.array(ret)

Testing:

In [155]: a = np.random.randint(0,100,(5000))
In [156]: amed = cummedian_stream(a)
In [157]: np.allclose(cummedian_sorted(a), amed)
Out[157]: True
In [158]: timeit cummedian_sorted(a)
1 loop, best of 3: 781 ms per loop
In [159]: timeit cummedian_stream(a)
10 loops, best of 3: 39.6 ms per loop

The heapq stream approach is way faster.

The list comprehension that @Uriel gave is relatively slow. But if I substitute np.median for statistics.median it is faster than @Divakar's sorted solution:

def fastloop(a):
    return np.array([np.median(a[:i+1]) for i in range(len(a))])

In [161]: timeit fastloop(a)
1 loop, best of 3: 360 ms per loop

And @Paul Panzer's partition approach is also good, but still slow compared to the streaming class.

In [165]: timeit cummedian_partition(a)
1 loop, best of 3: 391 ms per loop

(I could copy the streamMedian class to this answer if needed).

161

answered Sep 25 '22 13:09

hpaulj

Related questions
                            
                                Can I use pandas.dataframe.isin() with a numeric tolerance parameter?
                            
                                How to draw a precision-recall curve with interpolation in python?
                            
                                statistical summary table in sklearn.linear_model.ridge?
                            
                                scipy convolve2d outputs wrong values
                            
                                Log file to Pandas Dataframe
                            
                                Optional command line arguments
                            
                                Prevent pandas.read_csv from inferring dtypes
                            
                                Pandas str.count
                            
                                Segment tree implementation in Python
                            
                                More efficient way to clean a column of strings and add a new column
                            
                                How to serve an image from google cloud storage using python flask
                            
                                Pandas: create a dataframe from 2D numpy arrays preserving their sequential order
                            
                                Divide list to multiple lists based on elements value
                            
                                Pandas: Dataframe.Drop - ValueError: labels ['id'] not contained in axis
                            
                                Anaconda "failed to create process"
                            
                                Yes/No prompt in Python3 using strtobool
                            
                                How to optimize MAPE code in Python?
                            
                                Non-blocking requests in Sanic framework
                            
                                Don't understand cause of "IndexError: tuple index out of range" when formatting string
                            
                                How to create groups and assign permission during project setup in django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NumPy: calculate cumulative median

Tags:

python

vectorization

numpy

statistics

Artem Kupriyanov

People also ask

1 Answers

hpaulj

Recent Activity

Donate For Us