Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why comparing char variables twice is faster than comparing short variables once

I have thought one compare must be faster than two. But after my test, I found in debug mode short compare is a bit faster, and in release mode char compare is faster. And I want to know the true reason.

Following is the test code and test result. I wrote two simple functions, func1() using two char compares, and func2() using one short compare. The main function returns temporary return value to avoid compile optimization ignoring my test code. My compiler is GCC 4.7.2, CPU Intel® Xeon® CPU E5-2430 0 @ 2.20GHz (VM).

inline int func1(unsigned char word[2])
{
        if (word[0] == 0xff && word[1] == 0xff)
                return 1;
        return 0;
}

inline int func2(unsigned char word[2])
{
        if (*(unsigned short*)word == 0xffff)
                return 1;
        return 0;
}

int main()
{
        int n_ret = 0;
        for (int j = 0; j < 10000; ++j)
                for (int i = 0; i < 70000; ++i)
                        n_ret += func2((unsigned char*)&i);
        return n_ret;
}

Debug mode:

          func1      func2
real    0m3.621s    0m3.586s
user    0m3.614s    0m3.579s
sys     0m0.001s    0m0.000s

Release mode:

          func1      func2
real    0m0.833s    0m0.880s
user    0m0.831s    0m0.878s
sys     0m0.000s    0m0.002s

func1 edition's assembly code:

        .cfi_startproc
        movl    $10000, %esi
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L6:
        movl    $1, %edx
        xorl    %ecx, %ecx
        .p2align 4,,10
        .p2align 3
.L8:
        movl    %edx, -24(%rsp)
        addl    $1, %edx
        addl    %ecx, %eax
        cmpl    $70001, %edx
        je      .L3
        xorl    %ecx, %ecx
        cmpb    $-1, -24(%rsp)
        jne     .L8
        xorl    %ecx, %ecx
        cmpb    $-1, -23(%rsp)
        sete    %cl
        jmp     .L8
        .p2align 4,,10
        .p2align 3
.L3:
        subl    $1, %esi
        jne     .L6
        rep
        ret
        .cfi_endproc

func2 edition's assembly code:

        .cfi_startproc
        movl    $10000, %esi
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L4:
        movl    $1, %edx
        xorl    %ecx, %ecx
        jmp     .L3
        .p2align 4,,10
        .p2align 3
.L7:
        movzwl  -24(%rsp), %ecx
.L3:
        cmpw    $-1, %cx
        movl    %edx, -24(%rsp)
        sete    %cl
        addl    $1, %edx
        movzbl  %cl, %ecx
        addl    %ecx, %eax
        cmpl    $70001, %edx
        jne     .L7
        subl    $1, %esi
        jne     .L4
        rep
        ret
        .cfi_endproc
like image 631
magicyang Avatar asked Oct 21 '22 18:10

magicyang


1 Answers

In GCC 4.6.3 the code is different for the first and second pieces of code, and the runtime for the func1 option is noticeably slower if you run it for long enough. Unfortunately, with your very short runtime, the two appear similar in time.

Increasing the outer loop by a factor of 10 means it takes about 6 seconds for func2, and 10 seconds for func1. This s using gcc -std=c99 -O3 to compile the code.

The main difference, I expect, is from the extra branch introduced with the && statement. And the extra xorl %ecx, %ecx doesn't help much (I get the same, although my code looks subtly different when it comes to label names).

Edit: I did try to come up with a branchless solution using and instead of a branch, but the compile refuses to inline the function, so it takes 30 seconds instead of 10.

Benchmarks run on:

AMD Phenom(tm) II X4 965

Runs at 3.4 GHz.

like image 95
Mats Petersson Avatar answered Oct 24 '22 13:10

Mats Petersson