I have a list of integer numbers and I want to write a function that returns a subset of numbers that are within a range. Something like NumbersWithinRange(list, interval) function name...
I.e.,
list = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
interval = [4,20]
results = NumbersWithinRange(list, interval) # [4,4,6,8,7,8]
maybe i forgot to write one more number in results, but that's the idea...
The list can be as big as 10/20 million length, and the range is normally of a few 100.
Any suggestions on how to do it efficiently with python - I was thinking to use bisect.
Thanks.
I would use numpy for that, especially if the list is that long. For example:
In [101]: list = np.array([4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100])
In [102]: list
Out[102]:
array([ 4, 2, 1, 7, 9, 4, 3, 6, 8, 97, 7, 65, 3,
2, 2, 78, 23, 1, 3, 4, 5, 67, 8, 100])
In [103]: good = np.where((list > 4) & (list < 20))
In [104]: list[good]
Out[104]: array([7, 9, 6, 8, 7, 5, 8])
# %timeit says that numpy is MUCH faster than any list comprehension:
# create an array 10**6 random ints b/w 0 and 100
In [129]: arr = np.random.randint(0,100,1000000)
In [130]: interval = xrange(4,21)
In [126]: %timeit r = [x for x in arr if x in interval]
1 loops, best of 3: 14.2 s per loop
In [136]: %timeit good = np.where((list > 4) & (list < 20)) ; new_list = list[good]
100 loops, best of 3: 10.8 ms per loop
In [134]: %timeit r = [x for x in arr if 4 < x < 20]
1 loops, best of 3: 2.22 s per loop
In [142]: %timeit filtered = [i for i in ifilter(lambda x: 4 < x < 20, arr)]
1 loops, best of 3: 2.56 s per loop
The pure-Python Python sortedcontainers module has a SortedList type that can help you. It maintains the list automatically in sorted order and has been tested passed tens of millions of elements. The sorted list type has a bisect function you can use.
from sortedcontainers import SortedList
data = SortedList(...)
def NumbersWithinRange(items, lower, upper):
start = items.bisect(lower)
end = items.bisect_right(upper)
return items[start:end]
subset = NumbersWithinRange(data, 4, 20)
Bisecting and indexing will be much faster this way than scanning the entire list. The sorted containers module is very fast and has a performance comparison page with benchmarks against alternative implementations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With