There's this interesting API called Intervalindex
new in 0.20 that lets you create an index of intervals.
Given some sample data:
data = [(893.1516130000001, 903.9187099999999),
(882.384516, 893.1516130000001),
(817.781935, 828.549032)]
You can create the index like this:
idx = pd.IntervalIndex.from_tuples(data)
print(idx)
IntervalIndex([(893.151613, 903.91871], (882.384516, 893.151613], (817.781935, 828.549032]]
closed='right',
dtype='interval[float64]')
An interesting property of Interval
s is that you can perform interval checks with in
:
print(y[-1])
Interval(817.78193499999998, 828.54903200000001, closed='right')
print(820 in y[-1])
True
print(1000 in y[-1])
False
I would like to know how to apply this operation to the entire index. For example, given some number 900
, how could I retrieve a boolean mask of intervals for which this number fits in?
I can think of:
m = [900 in y for y in idx]
print(m)
[True, False, False]
Are there better ways to do this?
If you are interested in performance, an IntervalIndex is optimized for searching. using .get_loc
or .get_indexer
uses an internally built IntervalTree (like a binary tree), which is constructed on first use.
In [29]: idx = pd.IntervalIndex.from_tuples(data*10000)
In [30]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
92.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [40]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
42.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# construct tree and search
In [31]: %timeit -n 1 -r 1 idx.get_loc(900)
4.55 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# subsequently
In [32]: %timeit -n 1 -r 1 idx.get_loc(900)
137 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# for a single indexer you can do even better (note that this is
# dipping into the impl a bit
In [27]: %timeit np.arange(len(idx))[(900 > idx.left) & (900 <= idx.right)]
203 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that .get_loc() returns an indexer (which is actually more useful than a boolean array, but they are convertible to each other).
In [38]: idx.map(lambda x: 900 in x)
...:
Out[38]:
Index([ True, False, False, True, False, False, True, False, False, True,
...
False, True, False, False, True, False, False, True, False, False], dtype='object', length=30000)
In [39]: idx.get_loc(900)
...:
Out[39]: array([29997, 9987, 10008, ..., 19992, 19989, 0])
Returning a boolean array is converted to an array of indexers
In [5]: np.arange(len(idx))[idx.map(lambda x: 900 in x).values.astype(bool)]
Out[5]: array([ 0, 3, 6, ..., 29991, 29994, 29997])
This is what .get_loc() and .get_indexer() return:
In [6]: np.sort(idx.get_loc(900))
Out[6]: array([ 0, 3, 6, ..., 29991, 29994, 29997])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With