Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's faster on ARM? MUL or (SHIFT + SUB)?

Which is faster on ARM?

*p++ = (*p >> 7) * 255;

or

*p++ = ((*p >> 7) << 8) - 1

Essentially what I'm doing here is taking an 8-bit word and setting it to 255 if >= 128, and 0 otherwise.

like image 711
user1054922 Avatar asked Feb 04 '26 17:02

user1054922


2 Answers

If p is char below statement is just an assignment to 255.

*p++ = ((*p >> 7) << 8) - 1

If p is int, then of course it is a different story.

You can use GCC Explorer to see how the assembly output looks like. Below is appearently what you get from Linaro's arm-linux-gnueabi-g++ 4.6.3 with -O2 -march=armv7-a flags;

void test(char *p) {
  *p++ = (*p >> 7) * 255;

}

void test2(char *p) {
  *p++ = ((*p >> 7) << 8) - 1 ;
}

void test2_i(int *p) {
  *p++ = ((*p >> 7) << 8) - 1 ;
}

void test3(char *p) {
  *p++ = *p >= 128 ? ~0 : 0;
}

void test4(char *p) {
  *p++ = *p & 0x80 ? ~0 : 0; 
}

creates

test(char*):
    ldrb    r3, [r0, #0]    @ zero_extendqisi2
    sbfx    r3, r3, #7, #1
    strb    r3, [r0, #0]
    bx  lr
test2(char*):
    movs    r3, #255
    strb    r3, [r0, #0]
    bx  lr
test2_i(int*):
    ldr r3, [r0, #0]
    asrs    r3, r3, #7
    lsls    r3, r3, #8
    subs    r3, r3, #1
    str r3, [r0, #0]
    bx  lr
test3(char*):
    ldrsb   r3, [r0, #0]
    cmp r3, #0
    ite lt
    movlt   r3, #255
    movge   r3, #0
    strb    r3, [r0, #0]
    bx  lr
test4(char*):
    ldrsb   r3, [r0, #0]
    cmp r3, #0
    ite lt
    movlt   r3, #255
    movge   r3, #0
    strb    r3, [r0, #0]
    bx  lr

If you are not nitpicking best is always to check assembly of the generated code over such details. People tend to overestimate compilers, I agree most of the time they do great but I guess it is anyone's right to question generated code.

You should also be careful interpreting instructions, since they won't always match into cycle accurate listing due to core's architectural featuers like having out-of-order, super scalar deep pipelines. So it might not be always shortest sequence of instructions win.

like image 117
auselen Avatar answered Feb 06 '26 06:02

auselen


Well, to answer the question in your title, on ARM, a SHIFT+SUB can be done in a single instruction with 1 cycle latenency, while a MUL usually has multiple cycle latency. So the shift will usually be faster.

To answer the implied question of what C code to write for this, generally you are best off with the simplest code that expresses your intent:

*p++ = *p >= 128 ? ~0 : 0;  // set byte to all ones iff >= 128

or

*p++ = *p & 0x80 ? ~0 : 0;  // set byte to all ones based on the MSB

this will generally get converted by the compiler into the fastest way of doing it, whether that is a shift and whatever, or a conditional move.

like image 39
Chris Dodd Avatar answered Feb 06 '26 06:02

Chris Dodd