I'm not the most experienced assembly programmer, and I ran into the "cqo", "cdq" and "cwd" instructions, which are all valid x86_64 assembly.
I was wondering if there are any advantages of using cdq or cwd, when operating on smaller values. Is there is some difference in performance?
EDIT: Originally started looking into this, when calculating absolute value for one digit numbers.
For example if we have -9 value in al:
cwd
xor al,dl
sub al,dl
vs. Having it as a 32 bit value and calculating
cdq
xor eax,edx
sub eax,edx
or if we have a 64 bit value for -9
cqo
xor rax,rdx
sub rax,rdx
If the original value is 64 bits and consists of a value -9 to 9, effectively they all seem the same.
The CQO instruction can be used to produce a double quadword dividend from a quadword before a quadword division. The CWD and CDQ mnemonics reference the same opcode. The CWD instruction is intended for use when the operand-size attribute is 16 and the CDQ instruction for when the operand-size attribute is 32.
“longword”is 32 bits, a quadword” is 64. d) Do a cqto instruction (with no operands) to sign-extend it to. rdx. cqto stands for “convert quadword to octword.”
You only have a choice if your value is already sign-extended to fill more than 16 bits of rax.
If you have a signed 16bit int in ax, but the upper16 of eax is unknown or zero, you must keep using 16bit instructions. cdq
would set edx based on the garbage bit at the top of eax, not the sign-bit of your value in ax.
Similarly, if you were using 32bit ops to generate a signed 32bit int in eax, the upper32 will be zeroed, not sign-extended.
If you can, use cdq
. You might need cqo
if you need all 64bits set in rdx.
See http://agner.org/optimize/ to learn about making asm that runs fast on x86. 32bit operand size is the default in 64bit mode, so 16 or 64bit operands require an extra prefix. This means larger code size, which means worse I-cache efficiency (and often more decode bottlenecks on pre-Sandybridge CPUs; SnB's uop cache usually means decode isn't a problem.)
16bit also has a false dependency on the previous contents of the register, since writing ax doesn't clear the rest of rax. Fortunately, AMD64 was designed with out-of-order CPUs in mind, so it avoided repeating that design choice that's inconvenient for high-performance, by clearing the upper32 when writing the low 32bits of a GP reg. (x86 CPUs already used OOO when AMD64 was designed, unlike when ax was extended to eax).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With