Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the fastest way to get the mode of a numpy array

I have to find the mode of a NumPy array that I read from an hdf5 file. The NumPy array is 1d and contains floating point values.

my_array=f1[ds_name].value    
mod_value=scipy.stats.mode(my_array)

My array is 1d and contains around 1M values. It takes about 15 min for my script to return the mode value. Is there any way to make this faster?

Another question is why scipy.stats.median(my_array) does not work while mode works?

AttributeError: module 'scipy.stats' has no attribute 'median'

like image 457
Heli Avatar asked Nov 20 '25 09:11

Heli


1 Answers

The implementation of scipy.stats.mode has a Python loop for handling the axis argument with multidimensional arrays. The following simple implementation, for one-dimensional arrays only, is faster:

def mode1(x):
    values, counts = np.unique(x, return_counts=True)
    m = counts.argmax()
    return values[m], counts[m]

Here's an example. First, make an array of integers with length 1000000.

In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)

In [41]: x.shape
Out[41]: (1000000,)

Check that scipy.stats.mode and mode1 give the same result.

In [42]: from scipy.stats import mode

In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))

In [44]: mode1(x)
Out[44]: (1009, 1066)

Now check the performance.

In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.91 seconds for mode(x) and only 39.6 milliseconds for mode1(x).

like image 106
Warren Weckesser Avatar answered Nov 23 '25 00:11

Warren Weckesser