I'm wondering if there's an efficient way to do this in pandas: Given a dataframe, what is the first row that is smaller than a given value? For example, given:
addr
0 4196656
1 4197034
2 4197075
3 4197082
4 4197134
What is the first value that is smaller than 4197080? I want it to return just the row with 4197075. A solution would be to first filter by 4197080 and then take the last row, but that looks like to be an extremely slow O(N) operation (first building a dataframe and then taking its last row), while a binary search would take O(logN).
df.addr[ df.addr < 4197080].tail(1)
I timed it, and creating df.addr[ df.addr < 4197080]
more or less takes the same as df.addr[ df.addr < 4197080].tail(1)
, strongly hinting that internally it's building an entire df first.
num = np.random.randint(0, 10**8, 10**6)
num.sort()
df = pd.DataFrame({'addr':num})
df = df.set_index('addr', drop=False)
df = df.sort_index()
Getting the first smaller value is very slow:
%timeit df.addr[ df.addr < 57830391].tail(1)
100 loops, best of 3: 7.9 ms per loop
Using lt improves things a bit:
%timeit df.lt(57830391)[-1:]
1000 loops, best of 3: 853 µs per loop
But still nowhere near as fast as a binary search:
%timeit bisect(num, 57830391, 0, len(num))
100000 loops, best of 3: 6.53 µs per loop
Is there any better way?
This requires 0.14.0
Note that the frame IS NOT SORTED.
In [16]: s = df['addr']
Find biggest value lower than required
In [18]: %timeit s[s<5783091]
100 loops, best of 3: 9.01 ms per loop
In [19]: %timeit s[s<5783091].nlargest(1)
100 loops, best of 3: 11 ms per loop
So this is faster than actuallying performing a full-sort, then indexing.
The .copy
is to avoid biasing the inplace sort.
In [32]: x = np.random.randint(0, 10**8, 10**6)
In [33]: def f(x):
....: x.copy().sort()
....:
In [35]: %timeit f(x)
10 loops, best of 3: 67.2 ms per loop
If you are simply searching an ALREADY SORTED series, then use searchsorted
. Note that you must use the numpy version (e.g. operate on .values
. The series version will be defined in 0.14.1)
In [41]: %timeit s.values.searchsorted(5783091)
100000 loops, best of 3: 2.5 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With