How syscall knows where to jump? [closed]

Tags:

How does Linux determine the address of another process to execute with a syscall? Like in this example?

mov rax, 59 
mov rdi, progName
syscall

It seems there is a bit of confusion with my question, to clarify, what I was asking is how does syscall works, independently of the registers or arguments passed. How it knows where to jump, return etc when an other process is called.

508

asked Jul 02 '19 14:07

Bryan Jimenez Chacon

1 Answers

syscall

The syscall instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:

IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
  THEN #UD;
FI;
RCX ← RIP;
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH
CS.Base ← 0;
CS.Limit ← FFFFFH;
CS.Type ← 11;
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
CS.D ← 0;
CS.G ← 1;
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
SS.Base ← 0;
SS.Limit ← FFFFFH;
SS.Type ← 3;
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
SS.G ← 1;

The most important part are the two instructions that save and manage the RIP register:

RCX ← RIP
RIP ← IA32_LSTAR

So in other words, there must be code at the address saved in IA32_LSTAR (a register) and RCX is the return address.

The CS and SS segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)

The #UD may happen if you do not have the right to execute syscall or if the instruction doesn't exist.

How is `RAX` interpreted?

This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if RAX > __NR_syscall_max), then dispatches to (C syntax) sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);

; Intel-syntax translation of Linux 4.12 syscall entry point
       ...                 ; save user-space registers etc.
    call   [sys_call_table + rax * 8]       ; dispatch to sys_execve() or whatever kernel C function

;;; execve probably won't return via this path, but most other calls will
       ...                 ; restore registers except RAX return value, and return to user-space

Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of call *sys_call_table(, %rax, 8) from ENTRY(entry_SYSCALL_64) in Linux 4.12 arch/x86/entry/entry_64.S (before Spectre/Meltdown mitigations were added). Also related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? has some more details about the kernel side of system-call dispatching.

Fast?

The instruction is said to be fast. This is because in the old days one would have to use an instruction such as INT3. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow RTE to exit the exception state and return to the address just after the interrupt. This is generally much slower.

With the syscall you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.

Another instruction which is used along syscall is swapgs. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.

New Process?

The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.

When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:

Make sure the executable exists
Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
Allocates a stack buffer
Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
Setup the start address in the task (ELF entry point from the executable)
Mark the task as ready (a.k.a. running)

This is, of course, super simplified.

The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current RIP pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.

(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)

All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.

(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)

Determine the Address?

So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by ld). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.

It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.

In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.

Also, if the process is already running and you starting it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as copy on write.)

179

answered Oct 06 '22 00:10

Alexis Wilke

Related questions
                            
                                Site to site OpenSWAN VPN tunnel issues with AWS
                            
                                How to deal with EPOLLERR and EPOLLHUP?
                            
                                Make "git pull" ask for confirmation when pulling different branch
                            
                                Docker warning on cgroup swap limit, memory.use_hierarchy
                            
                                PHAR internal corruption (crc32 mismatch) during process fork
                            
                                Starting node app at startup on raspberry pi
                            
                                On Linux, is TLS set up by the kernel or by libc (or other language runtime)?
                            
                                PostgreSQL "cannot access the server configuration file (...) No such file or directory" after clean install
                            
                                How to to delete a line given with a variable in sed?
                            
                                How does "get_user_pages" work (For linux driver)
                            
                                Linux (Debian 8 Jessie) HRTimer - Kernel - Leap Seconds
                            
                                Error while Building Android Apps with Jenkins and Gradle on linux centos platform
                            
                                What is "hrtimer: interrupt took x ns" mean? seen in dmesg
                            
                                Geddy CLI closes on SSH drop
                            
                                What is contained in code/internal sections of JCMD?
                            
                                How to build a swift executable for Linux on macOS
                            
                                Electron-builder - how to run a script after/before installing an app
                            
                                Simulate mounted volume errors to cause read only
                            
                                How to fix java.lang.module.FindException: Module java.se.ee not found?
                            
                                Unknown type name uint64_t and uint16_t uint8_t in Linux [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How syscall knows where to jump? [closed]

Tags:

linux

assembly

x86-64

system-calls

nasm

Bryan Jimenez Chacon

People also ask

1 Answers

syscall

How is `RAX` interpreted?

Fast?

New Process?

Determine the Address?

Alexis Wilke

Recent Activity

Donate For Us

How syscall knows where to jump? [closed]

Tags:

linux

assembly

x86-64

system-calls

nasm

Bryan Jimenez Chacon

People also ask

1 Answers

syscall

How is RAX interpreted?

Fast?

New Process?

Determine the Address?

Alexis Wilke

Related questions

Recent Activity

Donate For Us

How is `RAX` interpreted?