Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.
Without NEON :
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int loop;
for( loop= 0; loop<size; loop++)
ptr[loop]<<=1;
return;
}
With NEON:
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,vector128Output;
for( i=0;i<(SIZE/4);i++)
{
Q0=vld1q_u32(ptr);
Q0=vaddq_u32(Q0,Q0);
vst1q_u32(ptr,Q0);
ptr+=4;
}
return;
}
Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.
UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.
Something like this might run a bit faster.
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,Q1,Q2,Q3;
for( i=0;i<(SIZE/16);i++)
{
Q0=vld1q_u32(ptr);
Q1=vld1q_u32(ptr+4);
Q0=vaddq_u32(Q0,Q0);
Q2=vld1q_u32(ptr+8);
Q1=vaddq_u32(Q1,Q1);
Q3=vld1q_u32(ptr+12);
Q2=vaddq_u32(Q2,Q2);
vst1q_u32(ptr,Q0);
Q3=vaddq_u32(Q3,Q3);
vst1q_u32(ptr+4,Q1);
vst1q_u32(ptr+8,Q2);
vst1q_u32(ptr+12,Q3);
ptr+=16;
}
return;
}
There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):
If you were to write this with inline assembly you could also:
95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.
The question is rather vague and you didn't provide much info but I'll try to give you some pointers.
size
, second uses SIZE
, is this intentional? Are they the same?vshl_n_u32
) instead of the addition.Edit: thanks for the answers. I've looked a bit around and found this discussion, which says (emphasis mine):
Moving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction.
In your case there's no NEON to ARM conversion but only loads and stores. Still, it seems that the savings in parallel operation are eaten up by the non-NEON parts. I would expect better results in code which does many things while in NEON, e.g. color conversions.
Process in bigger quantities per instruction, and interleave load/stores, and interleave usage. This function currently doubles (shifts left) 56 uint.
void shiftleft56(const unsigned int* input, unsigned int* output)
{
__asm__ (
"vldm %0!, {q2-q8}\n\t"
"vldm %0!, {q9-q15}\n\t"
"vshl.u32 q0, q2, #1\n\t"
"vshl.u32 q1, q3, #1\n\t"
"vshl.u32 q2, q4, #1\n\t"
"vshl.u32 q3, q5, #1\n\t"
"vshl.u32 q4, q6, #1\n\t"
"vshl.u32 q5, q7, #1\n\t"
"vshl.u32 q6, q8, #1\n\t"
"vshl.u32 q7, q9, #1\n\t"
"vstm %1!, {q0-q6}\n\t"
// "vldm %0!, {q0-q6}\n\t" if you want to overlap...
"vshl.u32 q8, q10, #1\n\t"
"vshl.u32 q9, q11, #1\n\t"
"vshl.u32 q10, q12, #1\n\t"
"vshl.u32 q11, q13, #1\n\t"
"vshl.u32 q12, q14, #1\n\t"
"vshl.u32 q13, q15, #1\n\t"
// lost cycle here unless you overlap
"vstm %1!, {q7-q13}\n\t"
: "=r"(input), "=r"(output) : "0"(input), "1"(output)
: "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
"q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15", "memory" );
}
What's important to remember for Neon optimization... It has two pipelines, one for load/stores (with a 2 instruction queue - one pending and one running - typically taking 3-9 cycles each), and one for arithmetical operations (with a 2 instruction pipeline, one executing and one saving its results). As long as you keep these two pipelines busy and interleave your instructions, it will work really fast. Even better, if you have ARM instructions, as long as you stay in registers, it will never have to wait for NEON to be done, they will be executed at the same time (up to 8 instructions in cache)! So you can put up some basic loop logic in ARM instructions, and they'll be executed simultaneously.
Your original code also was only using one register value out of 4 (q register have 4 32 bits values). 3 of them were getting a doubling operation for no apparent reason, so you were 4 times as slow as you could've been.
What would be better in this code is to for this loop, process them embedded by adding vldm %0!, {q2-q8}
following the vstm %1!
... and so on. You also see I wait 1 more instruction before sending out its results, so the pipes are never waiting for something else. Finally, note the !
, it means post-increment. So it reads/writes the value, and then increments the pointer from the register automatically. I suggest you don't use that register in ARM code, so it won't hang its own pipelines... keep your registers separated, have a redundant count
variable on ARM side.
Last part ... what I said might be true, but not always. It depends on the current Neon revision you have. Timing might change in the future, or might not have always been like that. It works for me, ymmv.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With