Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is numpy much slower than matlab on a digitize example?

I am comparing performance of numpy vs matlab, in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):

import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)

The result is:

Time passed is 55.91 seconds

Let's now try the same example Matlab using the equivalent function histc

scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])

The result is:

Time passed is 0.10237 seconds

That's 560 times faster!

As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):

import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok

The result is:

Time passed is 0.02 seconds

There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc.

So the questions are:

  • Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
  • Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?

I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.

Update

Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted. As one of the answers says, searchsorted is much faster than digitize. However, searchsorted is still slower than Matlab by a factor 2:

t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)

np.all(ind==ind3) #ok

The result is

Time passed is 0.21 seconds

So the questions are now:

  1. What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster?

  2. Why is the Matlab function histc (which also provides the output of numpy.searchsorted) 2 times faster than numpy.searchsorted?

like image 562
Mannaggia Avatar asked Feb 25 '12 13:02

Mannaggia


2 Answers

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):

static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
    npy_intp i;

    for ( i = 0; i < lbins; i ++ ) {
        if ( x < bins [i] ) {
            return i;
        }
    }
    return lbins;
}

static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
    npy_intp i;

    for ( i = lbins - 1; i >= 0; i -- ) {
        if (x < bins [i]) {
            return i + 1;
        }
    }
    return 0;
}

As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.

Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

like image 86
Justin Peel Avatar answered Nov 15 '22 14:11

Justin Peel


I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.

The function numpy.searchsorted() does basically the same thing as numpy.digitize(), but efficiently.

ind = np.searchsorted(y, scale, "right")

takes about 0.15 seconds on my machine and gives exactly the same result as your code.

Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram().

like image 44
Sven Marnach Avatar answered Nov 15 '22 15:11

Sven Marnach