Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of numpy.searchsorted is poor on structured arrays

Sorry in advance if I'm misusing any terms, feel free to correct that.

I have a sorted array with dtype '<f16, |S30'. When I use searchsorted on its first field, it works really slow (about 0.4 seconds for 3 million items). That is much longer than bisect takes to do the same on a plain Python list of tuples.

%timeit a['f0'].searchsorted(400.)
1 loops, best of 3: 398 ms per loop

However, if I copy the float part to another, separate array, the search is faster than bisect:

b = a['f0'].copy()

%timeit b.searchsorted(400.)
1000000 loops, best of 3: 945 ns per loop

My questions are:

  1. Am I doing something wrong or is it a regression in NumPy?
  2. Is there a way to circumvent this without duplication of the data?
like image 598
Lev Levitsky Avatar asked Feb 28 '13 15:02

Lev Levitsky


2 Answers

I remember seeing this some time ago. If I remember correctly, I think searchsorted makes a temporary copy of the data when the data is not contiguous. If I have time later, I'll take a look at the code to confirm that's what's happening (or maybe someone more familiar with the code can confirm this).

In the mean time, if you don't want to restructure your code to avoid using a structured array, your best bet is probably to use bisect_left(a['f0'], 400.). On my machine it's 8x slower than searchsorted on a contiguous array but 1000x faster than searchsorted on a non-contiguous array.

In [5]: a = np.arange((6e6)).view([('f0', float), ('f1', float)])

In [6]: timeit a['f0'].searchsorted(400.)
10 loops, best of 3: 51.1 ms per loop

In [7]: timeit a['f0'].copy()
10 loops, best of 3: 51 ms per loop

In [8]: timeit bisect_left(a['f0'], 400.)
10000 loops, best of 3: 52.8 us per loop

In [9]: f0 = a['f0'].copy()

In [10]: timeit f0.searchsorted(400.)
100000 loops, best of 3: 7.85 us per loop
like image 131
Bi Rico Avatar answered Oct 09 '22 17:10

Bi Rico


Here is some code to illustrate the size of problem (as of 11th May 2015) and how to 'fix' it.

import numpy as np
import bisect
import timeit
from random import randint

dtype = np.dtype([ ('pos','<I'),('sig','<H') ])             # my data is unsigned 32bit, and unsigned 16bit
data1 = np.fromfile('./all2/840d.0a9b45e8c5344abf6ac761017e93b5bb.2.1bp.binary', dtype)

dtype2 = np.dtype([('pos',np.uint32),('sig',np.uint32)])    # convert data to both unsigned 32bit
data2 = data1.astype(dtype2)

data3 = data2.view(('uint32', len(data2.dtype.names)))    # convert above to a normal array (not structured array)

print data1.dtype.descr # [('pos', '<u4'), ('sig', '<u2')]
print data2.dtype.descr # [('pos', '<u4'), ('sig', '<u4')]
print data3.dtype.descr # [('', '<u4')]

print data1.nbytes  # 50344494
print data2.nbytes  # 67125992
print data3.nbytes  # 67125992

print data1['pos'].max() # 2099257376
print data2['pos'].max() # 2099257376
print data3[:,0].max()   # 2099257376

def b1():   return bisect.bisect_left(data1['pos'],           randint(100000000,200000000))
def b2():   return bisect.bisect_left(data2['pos'],           randint(100000000,200000000))
def b3():   return bisect.bisect_left(data3[:,0],             randint(100000000,200000000))
def ss1():  return np.searchsorted(data1['pos'],              randint(100000000,200000000))
def ss2():  return np.searchsorted(data2['pos'],              randint(100000000,200000000))
def ss3():  return np.searchsorted(data3[:,0],                randint(100000000,200000000))

def ricob1():   return bisect.bisect_left(data1['pos'], np.uint32(randint(100000000,200000000)))
def ricob2():   return bisect.bisect_left(data2['pos'], np.uint32(randint(100000000,200000000)))
def ricob3():   return bisect.bisect_left(data3[:,0],   np.uint32(randint(100000000,200000000)))
def ricoss1():  return np.searchsorted(data1['pos'],    np.uint32(randint(100000000,200000000)))
def ricoss2():  return np.searchsorted(data2['pos'],    np.uint32(randint(100000000,200000000)))
def ricoss3():  return np.searchsorted(data3[:,0],      np.uint32(randint(100000000,200000000)))

print timeit.timeit(b1,number=300)  # 0.0085117816925
print timeit.timeit(b2,number=300)  # 0.00826191902161
print timeit.timeit(b3,number=300)  # 0.00828003883362
print timeit.timeit(ss1,number=300) # 6.57477498055
print timeit.timeit(ss2,number=300) # 5.95308589935
print timeit.timeit(ss3,number=300) # 5.92483091354

print timeit.timeit(ricob1,number=300)  # 0.00120902061462
print timeit.timeit(ricob2,number=300)  # 0.00120401382446
print timeit.timeit(ricob3,number=300)  # 0.00120711326599
print timeit.timeit(ricoss1,number=300) # 4.39265394211
print timeit.timeit(ricoss2,number=300) # 0.00116586685181
print timeit.timeit(ricoss3,number=300) # 0.00108909606934

Update! So thanks to Rico's comments, it seems like setting the type for the number you want to searchsorted/bisect is really import! However, on the structured array with 32bit and 16bit ints, its still slow (although no where near as slow as before)

like image 23
J.J Avatar answered Oct 09 '22 17:10

J.J