Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find numbers within a range bisect python

I have a list of integer numbers and I want to write a function that returns a subset of numbers that are within a range. Something like NumbersWithinRange(list, interval) function name...

I.e.,

list = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
interval = [4,20]
results = NumbersWithinRange(list, interval)  # [4,4,6,8,7,8]

maybe i forgot to write one more number in results, but that's the idea...

The list can be as big as 10/20 million length, and the range is normally of a few 100.

Any suggestions on how to do it efficiently with python - I was thinking to use bisect.

Thanks.

like image 286
Dnaiel Avatar asked Nov 30 '22 05:11

Dnaiel


2 Answers

I would use numpy for that, especially if the list is that long. For example:

In [101]: list = np.array([4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100])
In [102]: list
Out[102]: 
array([  4,   2,   1,   7,   9,   4,   3,   6,   8,  97,   7,  65,   3,
         2,   2,  78,  23,   1,   3,   4,   5,  67,   8, 100])
In [103]: good = np.where((list > 4) & (list < 20)) 
In [104]: list[good]
Out[104]: array([7, 9, 6, 8, 7, 5, 8])

# %timeit says that numpy is MUCH faster than any list comprehension: 
# create an array 10**6 random ints b/w 0 and 100
In [129]: arr = np.random.randint(0,100,1000000)
In [130]: interval = xrange(4,21)
In [126]: %timeit r = [x for x in arr if x in interval]
1 loops, best of 3: 14.2 s per loop

In [136]: %timeit good = np.where((list > 4) & (list < 20)) ; new_list = list[good]
100 loops, best of 3: 10.8 ms per loop

In [134]: %timeit r = [x for x in arr if 4 < x < 20]
1 loops, best of 3: 2.22 s per loop 

In [142]: %timeit filtered = [i for i in ifilter(lambda x: 4 < x < 20, arr)]
1 loops, best of 3: 2.56 s per loop
like image 139
reptilicus Avatar answered Dec 05 '22 02:12

reptilicus


The pure-Python Python sortedcontainers module has a SortedList type that can help you. It maintains the list automatically in sorted order and has been tested passed tens of millions of elements. The sorted list type has a bisect function you can use.

from sortedcontainers import SortedList
data = SortedList(...)

def NumbersWithinRange(items, lower, upper):
    start = items.bisect(lower)
    end = items.bisect_right(upper)
    return items[start:end]

subset = NumbersWithinRange(data, 4, 20)

Bisecting and indexing will be much faster this way than scanning the entire list. The sorted containers module is very fast and has a performance comparison page with benchmarks against alternative implementations.

like image 38
GrantJ Avatar answered Dec 05 '22 03:12

GrantJ