Do any SIMD/vector register instructions exist where the ymm register is specified in a general register (or SIMD register) rather than in the instruction itself?
Essentially what I'm trying to do is write a function that saves any series of consecutive ymm registers onto the local frame. Here is the idea, except I'm inventing what I consider semi-plausible fictional syntax for the instruction I'm looking for.
.text
.align 64
funcname:
orq %rcx, %rcx # is register count <= 0 ???
jle funcdone # # of registers to save is <= 0
xorq %rax, %rax # rax = 0 == vector save element #
funcloop:
vmovapd %ymm(%rsi), (%rdi, %rax) # save ymm?? to address rdi + rax
addq $32, %rax # each ymm vector is 32-bytes
loop funcloop # --rcx; if (rcx > 0) { goto funcloop }
funcdone:
ret
The vmovapd instruction is the strange instruction that does what I'm looking for. I'm sure I've never seen an instruction that looks like that, but that doesn't mean there isn't some unusual instruction that does what I need to do.
Or maybe the instruction would look like one of these:
vmovapd %rsi, (%rdi, %rax)
vmovapd (%rsi), (%rdi, %rax)
Another alternative would be bits 0 to 15 in %rsi correspond to vector registers ymm00 to ymm15, and the register corresponding to the lowest set bit is saved (or all "set bit" ymm registers get saved).
BTW, for what I need to accomplish, self-modifying code is not an option.
x86's state-save instructions (xsave/xrstor) do take masks in edx:eax to control what state to save/restore. It's really complicated, and the insn ref manual just points you at a whole separate section of another manual. IDK if you get to choose at the level of individual vector registers, though. More likely there's a single bit for "the low128 of all 16 vector regs", but ymm0-7 are separate from the rest, to avoid saving/restoring ymm8-15 when 32bit code can't affect them.
The specific state components saved correspond to the bits set in the requested-feature bitmap (RFBM), which is the logical-AND of EDX:EAX and XCR0.
For saving/restoring a few ymm regs in a function prologue/epilogue, it's unlikely to be useful. I haven't looked into it. xsavec does "compaction": The CPU tracks which pieces of state are actually modified.
There are no other instructions with an extra level of indirection for registers (a register specifies which register). That would be a big complication for the out-of-order machinery to implement. Even ARM load-multiple instructions (see the other answer) have the register bitmask embedded into the instruction, so it's available as the instruction is being decoded (rather than being filled in later).
You're probably better off with the obvious store/reload of any vector register you want to use, but that are call-preserved in the calling convention you're designing.
Note that future extensions to wider vectors mean you will end up with only the low 256b of your chosen vector regs being call-preserved, with the bits above that being clobbered. (zeroed or not, when callees don't touch them, instead of saving/restoring).
When it comes to "SIMD load / store", there's two different approaches possible:
ARM and/or m68k have always done the former - the "move multiple" opcode on these platforms ([v]ldm on ARM, movem on m68k) allows specifying a mask which enumerates the registers to be populated from (consecutive) data at the given address.
Intel's x86 in all its history has never had this, except for PUSHA / POPA in 32bit which would unconditionally / nonmaskably save/restore the general-purpose registers to/from the stack, and that instruction was retired for 64bit mode.
Intel, with AVX, instead created the ability to simultaneously load from multiple addresses, i.e. do "scatter-gather".
So yes, the x86 replacement for something on ARM like:
VLDM.64 r0, {d1, d3, d5, d7} ; load odd-numbered vector regs from mem @r0
would be a sequence:
VMOVAPD YMM1, [ RAX ]
VMOVAPD YMM3, [ RAX + 32 ]
VMOVAPD YMM5, [ RAX + 64 ]
VMOVAPD YMM7, [ RAX + 96 ]
On the other hand, the ARM equivalent (see ARM docs, "indirect addressing") of an x86 VGATHER instruction like:
VGATHERDD YMM0, [ RAX + YMM1 ] ; YMM1[0..8] has 32bit offsets on RAX
requires a multiple loads to single elements of a vector register with a "combine" at the end - or, sub-register loads; it'd turn into a sequence:
VLD1.32 {d0}, [r0] ; r0..r3 have the [32bit] addresses
VLD1.32 {d1}, [r1]
VLD1.32 {d2}, [r2]
VLD1.32 {d3}, [r3]
VTBL d0, { d0 - d3 }, d4 ; d4 is [ 0, .., 7, ..., 15, ..., 31, ... ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With