I am trying to call a function - that should have an absolute address when compiled and linked - from machine code. I am creating a function pointer to the desired function and trying to pass that to the call instruction, but I noticed that the call instruction takes at most a 16 or 32-bit address. Is there a way to call an absolute 64-bit address?
I am deploying for the x86-64 architecture and using NASM to generate the machine code.
I could work with a 32-bit address if I could be guaranteed that the executable would be for sure mapped to the bottom 4GB of memory, but I am not sure where I could find that information.
Edit: I cannot use the callf instruction, as that requires me to disable 64-bit mode.
Second Edit: I also do not want to store the address in a register and call the register, as this is performance critical, and I cannot have the overhead and performance hit of an indirect function call.
Final Edit: I was able to use the rel32 call instruction by ensuring that my machine code was mapping to the first 2GB of memory. This was achieved through mmap with the MAP_32BIT flag (I'm using linux):
MAP_32BIT (since Linux 2.4.20, 2.6) Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2GB of memory, so as to improve context- switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this per‐ formance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set.
related: Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code has more about JITing, especially allocating your JIT buffer near the code it wants to call, so you can use efficient call rel32
. Or what to do if not.
Also Call an absolute pointer in x86 machine code is a good canonical Q&A about call
or jmp
to an absolute address.
TL:DR: To call a function by name, just use call func
like a normal person and let the assembler + linker take care of it. Since you say you're using NASM, I guess you're actually generating the machine code with an assembler. It sounded like a more complicated question, but I think you were just trying to ask if the normal way was safe.
Indirect call r/m64
(FF /2
) takes a 64-bit register or memory operand in 64-bit mode.
So you can do
func equ 0x123456789ab
; or if func is a regular label
mov rax, func ; mov r64, imm64, or mov r32, imm32 if it fits
call rax
Normally you'd put a label address into a register with lea rax, [rel func]
, but if that's encodeable then you'd just use call rel32
.
Or, if you know what address your machine code will be stored in, you can use the normal direct call rel32
encoding, after you calculate the difference in address from the target to the end of the call
instruction.
If you don't want to use an indirect call, then the rel32
encoding is your only option. Make sure your machine code goes into the low 2GiB so it can reach any address in the low 4GiB.
if I could be guaranteed that the executable would be for sure mapped to the bottom 4GB of memory
Yes, this is the default code model for Linux, Windows, and OS X. AMD64 call / jump instructions, and RIP-relative addressing, only use rel32
encodings, so all systems default to the "small" code model where code and static data are in the low 2GiB, so it's guaranteed that the linker can just fill in a rel32 to reach up to 2G forward or 2G backward.
The x86-64 System V ABI does discuss Large / Huge code models, but IDK if anyone ever uses that, because of the inefficiency of addressing data and making calls.
re: efficiency: yes, mov
/ call rax
is less efficient. I think it's significantly slower if branch prediction misses and can't provide a target prediction from the BTB. However, even call rel32
and jmp rel32
still need the BTB for full performance. See Slow jmp-instruction for experimental results from relative jmp next_insn
slowing down when there are too many in a giant loop.
With hot branch predictors, the indirect version is only extra code size and an extra uop (the mov
). It might consume more prediction resources, but maybe not even that.
See also What branch misprediction does the Branch Target Buffer detect?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With