Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum optimization of element wise multiplication via ARM NEON assembly

I'm optimizing an element wise multiplication of two single dimensional arrays for a dual Cortex-A9 processor. Linux is running on the board and I'm using the GCC 4.5.2 compiler.

So the following is my C++ inline assembler function. src1, src2 and dst are 16 byte aligned.

Update: Testable Code:

void Multiply(
    const float* __restrict__ src1,
    const float* __restrict__ src2,
    float* __restrict__ dst,
    const unsigned int width,
    const unsigned int height)
{
    int loopBound = (width * height) / 4;
    asm volatile(
        ".loop:                             \n\t"
        "vld1.32  {q1}, [%[src1]:128]!      \n\t"
        "vld1.32  {q2}, [%[src2]:128]!      \n\t"
        "vmul.f32 q0, q1, q2                \n\t"
        "vst1.32  {q0}, [%[dst]:128]!       \n\t"
        "subs     %[lBound], %[lBound], $1  \n\t"
        "bge      .loop                     \n\t"
        :
        :[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2),
        [lBound] "r" (loopBound)
        :"memory", "d0", "d1", "d2", "d3", "d4", "d5
    );
}

//The following function describes how to test the element wise multiplication
void Test()
{
    const unsigned int width = 1024, height = 1024;
    float* src1 __attribute__((aligned(16))) = new float[width * height];
    float* src2 __attribute__((aligned(16))) = new float[width * height];
    float* dst __attribute__((aligned(16))) = new float[width * height];
    for(unsigned int i = 0; i < (width * height); i++)
    {
        src1[i] = (float)rand();
        src2[i] = (float)rand();
    }
    Multiply(src1, src2, dst, width, height);

    std::cout << dst[0] << std::endl;
}

The calculation of 1024*1024 values takes ~0.016 s. (Two threads - each thread calculates a half of the array). Naively interpreted, the calculation of one iteration takes 122 cycles. This seems to be a bit slow. But where is the bottleneck?

I even tried the pld command for preloading elements in the L2 cache, "unrolling" the loop by calculating up to 20 values per iteration and reordering the instructions to make shure the processor is not waiting for memory. I didn't get that much speedup (max 0.001 s faster).

Do you have any suggestions for speeding up the calculation?

like image 210
HyraxK Avatar asked Oct 08 '12 07:10

HyraxK


1 Answers

I don't really know that much about the NEON. However, I think that you have data dependencies that cause performance issues. I would suggest you prime the loop with some loads and then place them between the multiply and store. I think the store is probably blocking until the multiply is done.

    asm volatile(
    "vld1.32  {q1}, [%[src1]:128]!      \n\t"
    "vld1.32  {q2}, [%[src2]:128]!      \n\t"
    ".loop:                             \n\t"
    "vmul.f32 q0, q1, q2                \n\t"
    "vld1.32  {q1}, [%[src1]:128]!      \n\t"
    "vld1.32  {q2}, [%[src2]:128]!      \n\t"
    "vst1.32  {q0}, [%[dst]:128]!       \n\t"
    "subs     %[lBound], %[lBound], $1  \n\t"
    "bge      .loop                     \n\t"
    :
    :[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2),
    [lBound] "r" (loopBound)
    :"memory", "d0", "d1", "d2", "d3", "d4", "d5
);

This way you should be able to parallel the loads with the multiply. You will need to over-allocate the source arrays or change the loop index and do a final multiply and store. If the NEON ops are not affecting the condition codes, you can re-order the subs as well and place it earlier.

Edit: In fact, the Cortex A-9 Media processing Engine document recommends interleaving ARM and NEON instructions as they can execute in parallel. Also, the NEON instructions seem to set FPSCR and not the ARM CPSR so re-ordering the subs would decrease the execution time. You may also cache align the loop.

like image 62
artless noise Avatar answered Sep 20 '22 15:09

artless noise