Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to find out minimum of 3 numbers?

Tags:

In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:

static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
    unsigned int m = a;
    if (m > b) m = b;
    if (m > c) m = c;
    return m;
}

Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.

Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3
* As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. :)
* I have a quad-core Intel 64 running, so MMX/SSE etc. are supported.
* It's hard to post the loop here, but I can tell you it's a heavily optimized implementation of the levenshtein algorithm.

This is what the compiler is giving me for the non-inlined version of min:

.globl min
    .type   min, @function
min:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %eax
    movl    16(%ebp), %ecx
    cmpl    %edx, %eax
    jbe .L2
    movl    %edx, %eax
.L2:
    cmpl    %ecx, %eax
    jbe .L3
    movl    %ecx, %eax
.L3:
    popl    %ebp
    ret
    .size   min, .-min
    .ident  "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
    .section    .note.GNU-stack,"",@progbits

The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn't get hold of it.

Update: I tested the changes suggested by Nils, ephemient, however there's no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.

P.S. - I used the ruby profiler to measure performance (my C program is a shared library loaded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.

like image 250
Sudhanshu Avatar asked Jan 11 '10 03:01

Sudhanshu


People also ask

How do you find the minimum of three numbers?

Get three inputs num1, num2 and num3 from user using scanf statements. Check whether num1 is smaller than num2 and num3 using if statement, if it is true print num1 is smallest using printf statement. Else, num2 or num3 is smallest. So check whether num2 is smaller than num3 using elseif statement.

Can math min take 3 arguments?

You will find that Math. min() has 4 implementations, each only hold 2 arguments. If you need more arguments you will have to do your own.


2 Answers

Make sure you are using an appropriate -march setting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2 I get:

min:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %edx
    movl    12(%ebp), %ecx
    movl    16(%ebp), %eax
    cmpl    %edx, %ecx
    leave
    cmovbe  %ecx, %edx
    cmpl    %eax, %edx
    cmovbe  %edx, %eax
    ret

The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.

like image 101
bdonlan Avatar answered Nov 18 '22 09:11

bdonlan


Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.

If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.

The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.

If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.

Edit: If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminud instruction (intrinsic: _mm_min_epu32), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.

I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.

like image 44
Stephen Canon Avatar answered Nov 18 '22 08:11

Stephen Canon