Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace list values above and below thresholds with default value in python?

Tags:

python

numpy

I am trying to replace "bad values" below and above of thresholds with a default value (e.g. setting them to NaN). I am unsing a numpy array with 1000k values and more - so performance is an issue.

My prototype does the operation in two steps, is there a pssoibility to do this in one step?

import numpy as np

data = np.array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

upper_threshold = 7
lower_threshold = 1
default_value = np.NaN

# is it possible to do this in one expression?
data[data > upper_threshold] = default_value
data[data < lower_threshold] = default_value

print data # [ nan   1.   2.   3.   4.   5.   6.   7.  nan  nan]

As commented in this related question (Pythonic way to replace list values with upper and lower bound (clamping, clipping, thresholding)?)

Like many other functions, np.clip is python, but it defers to arr.clip, the method. For regular arrays that method is compiled, so will be faster (about 2x). – hpaulj

I hope to find a faster way too, thanks in advance!

like image 847
ppasler Avatar asked Oct 18 '22 19:10

ppasler


1 Answers

Use boolean-indexing in one go with a combined mask -

data[(data > upper_threshold) | (data < lower_threshold)] = default_value

Runtime test -

In [109]: def onepass(data, upper_threshold, lower_threshold, default_value):
     ...:     mask = (data > upper_threshold) | (data < lower_threshold)
     ...:     data[mask] = default_value
     ...: 
     ...: def twopass(data, upper_threshold, lower_threshold, default_value):
     ...:     data[data > upper_threshold] = default_value
     ...:     data[data < lower_threshold] = default_value
     ...:     

In [110]: upper_threshold = 7
     ...: lower_threshold = 1
     ...: default_value = np.NaN
     ...: 

In [111]: data = np.random.randint(-4,11,(1000000)).astype(float)

In [112]: %timeit twopass(data, upper_threshold, lower_threshold, default_value)
100 loops, best of 3: 2.41 ms per loop

In [113]: data = np.random.randint(-4,11,(1000000)).astype(float)

In [114]: %timeit onepass(data, upper_threshold, lower_threshold, default_value)
100 loops, best of 3: 2.74 ms per loop

Doesn't look like we are performing any better with the proposed one-pass-indexing method. The reason could be that the computation of OR-ing of masks is a bit more expensive than directly assigning values with the boolean-indexing itself.

like image 182
Divakar Avatar answered Oct 21 '22 02:10

Divakar