Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast searching for the lowest value greater than x in a sorted vector

Fast means better than O(N), which is as good as find() is capable of. I know there is ismembc and ismembc2, but I don't think either of them are what I am looking for. I read the documentation and it seems they search for a member equal to x, but I want the index of first value greater than x.

Now if either of these functions is capable of doing this, could somebody please give an example, because I can't figure it out.

Ideal behaviour:

first_greater_than([0, 3, 3, 4, 7], 1)

returns 2, the index of the first value greater than 1, though obviously the input array would be vastly larger.

Of course, a binary search isn't too difficult to implement, but if MATLAB has already done it, I would rather use their method.

like image 795
enigmaticPhysicist Avatar asked Dec 09 '13 02:12

enigmaticPhysicist


2 Answers

Since the input is already sorted a custom binary search should work (you may need to do some updates for edge cases, i.e value requested is less than all elements of the array):

function [result, res2] = binarySearchExample(val) 

    %// Generate example data and sort it
    N = 100000000;
    a = rand(N, 1);
    a = sort(a);

    %// Run the algorithm
    tic % start timing of the binary search algorithm
    div = 1;
    idx = floor(N/div);
    while(1)
        div = div * 2;

        %// Check if less than val check if the next is greater
        if a(idx) <= val,
            if a(idx + 1) > val,
                result = a(idx + 1);
                break
            else %// Get bigger 
                idx = idx + max([floor(N / div), 1]);
            end
        end
        if a(idx) > val, % get smaller
            idx = idx - floor(N / div);
        end
    end % end the while loop
    toc % end timing of the binary search algorithm

    %% ------------------------
    %% compare to MATLAB find
    tic % start timing of a matlab find
    j = find(a > val, 1);
    res2 = a(j);
    toc % end timing of a matlab find

%// Benchmark
>> [res1, res2] = binarySearchExample(0.556)

Elapsed time is 0.000093 seconds.
Elapsed time is 0.327183 seconds.

res1 =
   0.5560

res2 =
   0.5560
like image 163
ljk07 Avatar answered Nov 15 '22 09:11

ljk07


Here's my implementation. This is not quite the answer I was looking for, but for now, I will have to assume what I am after is not implemented in MATLAB.

A Note on Indices

All MATLAB indices are done wrong, in that they start from 1 and not 0. I still index from 0, though, in spite of this. So throughout, you will see indexing that looks like this: array(1+i) accesses element i, where i is in [0, N). Also, all MATLAB ranges are done wrong. Their convention is [a, b], instead of [a, b). So you will see ranges that look like this throughout: 0:N-1 is the range of numbers (often indices of an N-dimensional array) from 0 to N. When arrays are indexed with a range, both corrections have to be made at the same time. 1 is added to the top and bottom boundary and 1 is subtracted from the top. This is the result: array(1+a:b) accesses elements in [a, b), where a and b are in [0, N) and b > a. I should really just be using python and scipy instead, but it's too late for that. Next project...

binary_search.m: It is much tidier than @ljk07's implementation in my opinion, but they still get an accept, of course. Thanks, @ljk07.

function i = binary_search(v, x)
%binary_search finds the first element in v greater than x
% v is a vector and x is a double. Returns the index of the desired element
% as an int64 or -1 if it doesn't exist.

% We'll call the first element of v greater than x v_f.

% Is v_f the zeroth element? This is technically covered by the algorithm,
% but is such a common case that it should be addressed immediately. It
% would otherwise take the same amount of time as the rest of them. This
% will add a check to each of the others, though, so it's a toss-up to an
% extent.
if v(1+0) > x
    i = 0;
    return;
end

% MATLAB foolishly returns the number of elements as a floating point
% constant. Thank you very much, MATLAB.
b = int64(numel(v));

% If v_f doesn't exist, return -1. This is also needed to ensure the
% algorithm later on terminates, which makes sense.
if v(1+b-1) <= x
    i = -1;
    return;
end

a = int64(0);

% There is now guaranteed to be more than one element, since if there
% wasn't, one of the above would have matched. So we split the [a, b) range
% at the top of the loop.

% The number of elements in the interval. Calculated once per loop. It is
% recalculated at the bottom of the loop, so it needs to be calculated just
% once before the loop can begin.
n = b;
while true
    % MATLAB's / operator foolishly rounds to nearest instead of flooring
    % when both inputs are integers. Thank you very much, MATLAB.
    p = a + idivide(n, int64(2));

    % Is v_f in [a, p) or [p, b)?
    if v(1+p-1) > x
        % v_f is in [a, p).
        b = p;
    else
        % v_f is in [p, b).
        a = p;
    end

    n = b - a;
    if n == 1
        i = a;
        return;
    end
end
end

binary_search_test.m:

% Some simple tests. These had better pass...
assert(binary_search([0], 0) == -1);
assert(binary_search([0], -1) == 0);

assert(binary_search([0 1], 0.5) == 1);
assert(binary_search([0 1 1], 0.5) == 1);
assert(binary_search([0 1 2], 0.5) == 1);
assert(binary_search([0 1 2], 1.5) == 2);

% Compare the algorithm to internal find.
for n = [1 1:8]
    n
    v = sort(rand(10^n, 1));
    x = 0.5;
    %%
    tic;
    ifind = find(v > x, 1,'first') - 1;
    toc;
    % repeat. The second time is faster usually. Some kind of JIT
    % optimisation...
    tic;
    ifind = find(v > x, 1,'first') - 1;
    toc;
    tic;
    ibs = binary_search(v, x);
    toc;
    tic;
    ibs = binary_search(v, x);
    toc;
    assert(ifind == ibs);
end

Output of binary_search_test.m (on my computer):

n =

     1

Elapsed time is 0.000054 seconds.
Elapsed time is 0.000021 seconds.
Elapsed time is 0.001273 seconds.
Elapsed time is 0.001135 seconds.

n =

     2

Elapsed time is 0.000050 seconds.
Elapsed time is 0.000018 seconds.
Elapsed time is 0.001571 seconds.
Elapsed time is 0.001494 seconds.

n =

     3

Elapsed time is 0.000034 seconds.
Elapsed time is 0.000025 seconds.
Elapsed time is 0.002344 seconds.
Elapsed time is 0.002193 seconds.

n =

     4

Elapsed time is 0.000057 seconds.
Elapsed time is 0.000044 seconds.
Elapsed time is 0.003131 seconds.
Elapsed time is 0.003031 seconds.

n =

     5

Elapsed time is 0.000473 seconds.
Elapsed time is 0.000333 seconds.
Elapsed time is 0.003620 seconds.
Elapsed time is 0.003161 seconds.

n =

     6

Elapsed time is 0.003984 seconds.
Elapsed time is 0.003635 seconds.
Elapsed time is 0.004209 seconds.
Elapsed time is 0.003825 seconds.

n =

     7

Elapsed time is 0.034811 seconds.
Elapsed time is 0.039106 seconds.
Elapsed time is 0.005089 seconds.
Elapsed time is 0.004867 seconds.

n =

     8

Elapsed time is 0.322853 seconds.
Elapsed time is 0.323777 seconds.
Elapsed time is 0.005969 seconds.
Elapsed time is 0.005487 seconds.

There is definitely a speedup. On my computer you can see the speedup is reached in around the one million element mark. So unless binary_search is implemented in C or you have a vector with about million elements, find is still faster, even though it is using a stupid algorithm. I was expecting the threshold to be waaay lower than that. My guess is because find is mostly internally implemented in C. Not fair :( But nevertheless, for my particular application, I have vector sizes of only about a thousand, so after all that, find really is faster for me. At least until the day I implement binary_search in C with a mex file or switch to scipy, whichever happens first. I'm kind of getting tired of MATLAB's little inconvenient switchups. You can tell by reading the comments in my code.

like image 20
enigmaticPhysicist Avatar answered Nov 15 '22 07:11

enigmaticPhysicist