Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"cqo", "cdq" and "cwd" x86_64 instructions. Why not use just cqo?

I'm not the most experienced assembly programmer, and I ran into the "cqo", "cdq" and "cwd" instructions, which are all valid x86_64 assembly.

I was wondering if there are any advantages of using cdq or cwd, when operating on smaller values. Is there is some difference in performance?

EDIT: Originally started looking into this, when calculating absolute value for one digit numbers.

For example if we have -9 value in al:

cwd
xor al,dl
sub al,dl

vs. Having it as a 32 bit value and calculating

cdq
xor eax,edx
sub eax,edx

or if we have a 64 bit value for -9

cqo
xor rax,rdx
sub rax,rdx

If the original value is 64 bits and consists of a value -9 to 9, effectively they all seem the same.

like image 760
Husky Avatar asked Nov 19 '15 19:11

Husky


People also ask

What does Cqo do in assembly?

The CQO instruction can be used to produce a double quadword dividend from a quadword before a quadword division. The CWD and CDQ mnemonics reference the same opcode. The CWD instruction is intended for use when the operand-size attribute is 16 and the CDQ instruction for when the operand-size attribute is 32.

What is CQTO?

“longword”is 32 bits, a quadword” is 64. d) Do a cqto instruction (with no operands) to sign-extend it to. rdx. cqto stands for “convert quadword to octword.”


1 Answers

You only have a choice if your value is already sign-extended to fill more than 16 bits of rax.

If you have a signed 16bit int in ax, but the upper16 of eax is unknown or zero, you must keep using 16bit instructions. cdq would set edx based on the garbage bit at the top of eax, not the sign-bit of your value in ax.

Similarly, if you were using 32bit ops to generate a signed 32bit int in eax, the upper32 will be zeroed, not sign-extended.

If you can, use cdq. You might need cqo if you need all 64bits set in rdx.


See http://agner.org/optimize/ to learn about making asm that runs fast on x86. 32bit operand size is the default in 64bit mode, so 16 or 64bit operands require an extra prefix. This means larger code size, which means worse I-cache efficiency (and often more decode bottlenecks on pre-Sandybridge CPUs; SnB's uop cache usually means decode isn't a problem.)

16bit also has a false dependency on the previous contents of the register, since writing ax doesn't clear the rest of rax. Fortunately, AMD64 was designed with out-of-order CPUs in mind, so it avoided repeating that design choice that's inconvenient for high-performance, by clearing the upper32 when writing the low 32bits of a GP reg. (x86 CPUs already used OOO when AMD64 was designed, unlike when ax was extended to eax).

like image 174
Peter Cordes Avatar answered Oct 07 '22 12:10

Peter Cordes