Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an advantage of specifying "-mfpu=neon-vfpv3" over "-mfpu=neon" for ARMs with separate pipelines?

My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions".

So far I compiled my programs with Linaro GCC 6.3-2017.05 and the -mfpu=neon option, to make use of SIMD instructions. But in the case that the compiler also has non-SIMD operations to be issued, will it make a difference to use -mfpu=neon-vfpv3? Will GCC's instruction selection and scheduler emit instructions for both versions, so that it could then make use of both pipelines, to increase utilization of the CPU?

like image 728
Johannes Schaub - litb Avatar asked Dec 12 '17 08:12

Johannes Schaub - litb


2 Answers

Technically, yes.

Reality, no.

NEON has been optional on ARMv7.

The licensees can choose one configuration from below:

  • none
  • VFP only
  • NEON plus VFP

Unlike NEON, there has been different VFP versions on ARMv7, the VFP-lite on Cortex-A8 being the most notorious one for not pipelining, thus extremely slow.

Therefore, it technically makes sense to specify the CPU configuration and the architecture version via compiler options so that the compilers can generate the most optimized machine codes for that particular architecture/configuration.

In reality however, the compilers these days ignore most of these build options and even directives in addition.

And that the VFP and NEON instructions are assigned to different pipelines won't help much, if at all since they both share the register bank.

Boosting NEON's performance by utilizing as many registers as possible would bring much more than let the VFP run in parallel instead.

It riddles me why and how so many people put so much trust in free compilers these days.

The best ARM compiler available is hands down ARM's that comes with the $6k+ DS-5 Ultimate Edition. Their support is excellent, but I'm not sure if it justifies the price tag.

like image 188
Jake 'Alquimista' LEE Avatar answered Nov 13 '22 10:11

Jake 'Alquimista' LEE


ARM's Cortex-A9 NEON/VFP manual (Cortex™ -A9 NEON™ Media Processing Engine) says, in section 3.2 Writing optimal VFP and Advanced SIMD code:

The following guidelines can provide significant performance increases for VFP and Advanced SIMD code: Where possible avoid:

  • ...

  • mixing Advanced SIMD only instructions with VFP only instructions.

It says it can execute NEON and VFP instructions in parallel with ARM or Thumb instructions (i.e. scalar integer code), "with the exception of simultaneous loads and stores".

It's not 100% clear if they mean avoid having them in flight at once at all, or if they mean avoid having data dependencies between VFP and NEON instructions. It's easy to imagine the latter being bad for reasons that don't apply to the former (e.g. maybe no bypass forwarding between execution units in different domains).


The cycle timings in the same document indicate that VFP scalar instructions take longer in the pipeline than NEON instructions (even if the latency appears to be the same), so probably using VFP is a win for code that doesn't vectorize, even with -ffast-math. Or if I'm reading this right, NEON has lower latency MUL, so may be a win for long dependency chains.

Cortex-A9, if it has VFP, has fully-pipelined VFP FPUs. e.g.

  • VADD/VSUB .F (Sn) or .D (Dn) ((VFP): 1c throughput. Inputs needed on cycle 1, results ready on cycle 4. (So 4c latency?)

  • VADD/VSUB Dn (NEON): 1c throughput. Inputs needed on cycle 2, results ready on cycle 5 (write-back on cycle 6). (So 4c or 5c latency?, depending on what consumes the result).

  • VADD/VSUB Qn (NEON): (1 per) 2c throughput. Inputs needed on cycle 2 then 3, results ready on cycle 5 then 6. (Write-back 1c later than that) (So 4c or 5c latency?).

  • VMUL .F Sd,Sn,Sm (VFP): 1c throughput, Inputs needed on cycle 1, results ready on cycle 5. (So 5c latency?)

  • VMUL (VFP) with double-precision isn't listed, only VNMUL (2c throughput).

  • VMUL (NEON): same timings as VADD/VSUB. Maybe not handling denormals allows a shortcut? If I'm reading this right, it's actually lower latency than VFP, except for the instruction needing to issue earlier.

There's also special result-forwarding for multiply-accumulate. See the PDF.

like image 3
Peter Cordes Avatar answered Nov 13 '22 09:11

Peter Cordes