I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8
returns a uint8x8x2_t
value (in fact an array of two uint8x8_t
). I want to assign the returned value to a plain uint16x8_t
. I see no appropriate vreinterpretq
intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
- sixteen 128-bit quadword registers, Q0-Q15
- thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t
is a type which requires 128-bit storage thus it needs to be in an quadword
register.
ARM NEON Intrinsics has a definition called vector array data type
in ARM® C Language Extensions:
... for use in load and store operations, in table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t
can contain non-consecutive two dualword registers while uint16x8_t
is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type
with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With