Might look similar to: ARM and NEON can work in parallel?, but its not, I have some other issue ( may be problem with my understanding):
In the protocol stack, while we compute checksum, that is done on the GPP, I’m handing over that task now to NEON as part of a function:
Here is the checksum function that I have written as a part of NEON, posted in Stack Overflow: Checksum code implementation for Neon in Intrinsics
Now, suppose from linux this function is called,
ip_csum(){
…
…
csum = do_csum(); //function call from arm
…
…
}
do_csum(){
…
…
//NEON optimised code
…
…
returns the final checksum to ip_csum/linux/ARM
}
in this case.. what happens to ARM when NEON is doing the calculations? does ARM sit idle? or it moves on with other operations?
as you can see do_csum is called and we are waiting on that result ( or that is what it looks like)..
NOTE:
Questions:
The NDK supports ARM Advanced SIMD, commonly known as Neon, an optional instruction set extension for ARMv7 and ARMv8. Neon provides scalar/vector instructions and registers (shared with the FPU) comparable to MMX/SSE/3DNow! in the x86 world.
Arm Neon technology is an advanced Single Instruction Multiple Data (SIMD) architecture extension for the A-profile and R-profile processors. Neon technology is a packed SIMD architecture.
Arm® NEON™ technology is an advanced single instruction multiple data (SIMD) architecture extension for the Arm® Cortex®-A Series. NEON technology is intended to improve the multimedia user experience by accelerating audio and video encoding/decoding, user interface, 2D/3D graphics or gaming.
Arm Neon is an advanced single instruction multiple data (SIMD) architecture extension for the Arm Cortex-A and Arm Cortex-R series of processors with capabilities that vastly improve use cases on mobile devices, such as multimedia encoding/decoding, user interface, 2D/3D graphics and gaming.
In Application Level Programmers’ Model
, you can't really distinguish between ARM and NEON units.
While NEON being a separate hardware unit (that is available as an option on Cortex-A series processors), it is the ARM core who drives it in a tight fashion. It is not a separate DSP which you can communicate in an asynchronous fashion.
You can write better code by fully utilizing pipelines on both units, but this is not same as having a separate core.
NEON unit is there because it can do some operations (SIMDs) much faster than ARM unit at a low frequency.
This is like having a friend who is good at math, whenever you have a hard question you can ask him. While waiting for an answer you can do some small things like if answer is this I should do this or if not instead do that but if you depend on that answer to go on, you need to wait for him to answer before going further. You could calculate the answer yourself but it will be much faster even including the communication time between two of you compared to doing all the math yourself. I think you can even extend this analogy like "you also need to buy some lunch to that friend (energy consumption) but in many cases it worths it".
Anyone who is saying ARM core can do other things while NEON core is working on its stuff is talking about instruction-level parallelism not anything like task-level parallelism.
(Image from TI Wiki Cortex A8)
The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.
Here are some hints on how to read the diagram above:
If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.
This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With