Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Efficient way to get first row with element that is smaller than a given value

Tags:

python

pandas

I'm wondering if there's an efficient way to do this in pandas: Given a dataframe, what is the first row that is smaller than a given value? For example, given:

      addr
0  4196656
1  4197034
2  4197075
3  4197082
4  4197134

What is the first value that is smaller than 4197080? I want it to return just the row with 4197075. A solution would be to first filter by 4197080 and then take the last row, but that looks like to be an extremely slow O(N) operation (first building a dataframe and then taking its last row), while a binary search would take O(logN).

df.addr[ df.addr < 4197080].tail(1)

I timed it, and creating df.addr[ df.addr < 4197080] more or less takes the same as df.addr[ df.addr < 4197080].tail(1), strongly hinting that internally it's building an entire df first.

num = np.random.randint(0, 10**8, 10**6)
num.sort()
df = pd.DataFrame({'addr':num})
df = df.set_index('addr', drop=False)
df = df.sort_index()

Getting the first smaller value is very slow:

%timeit df.addr[ df.addr < 57830391].tail(1)
100 loops, best of 3: 7.9 ms per loop

Using lt improves things a bit:

%timeit df.lt(57830391)[-1:]
1000 loops, best of 3: 853 µs per loop

But still nowhere near as fast as a binary search:

%timeit bisect(num, 57830391, 0, len(num))
100000 loops, best of 3: 6.53 µs per loop

Is there any better way?

like image 568
Georgios Bitzes Avatar asked Jun 17 '14 12:06

Georgios Bitzes


1 Answers

This requires 0.14.0

Note that the frame IS NOT SORTED.

In [16]: s = df['addr']

Find biggest value lower than required

In [18]: %timeit s[s<5783091]
100 loops, best of 3: 9.01 ms per loop

In [19]: %timeit s[s<5783091].nlargest(1)
100 loops, best of 3: 11 ms per loop

So this is faster than actuallying performing a full-sort, then indexing. The .copy is to avoid biasing the inplace sort.

In [32]: x = np.random.randint(0, 10**8, 10**6)

In [33]: def f(x):
   ....:     x.copy().sort()
   ....:     

In [35]: %timeit f(x)
10 loops, best of 3: 67.2 ms per loop

If you are simply searching an ALREADY SORTED series, then use searchsorted. Note that you must use the numpy version (e.g. operate on .values. The series version will be defined in 0.14.1)

In [41]: %timeit  s.values.searchsorted(5783091)
100000 loops, best of 3: 2.5 µs per loop
like image 194
Jeff Avatar answered Sep 21 '22 14:09

Jeff