Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How syscall knows where to jump? [closed]

How does Linux determine the address of another process to execute with a syscall? Like in this example?

mov rax, 59 
mov rdi, progName
syscall

It seems there is a bit of confusion with my question, to clarify, what I was asking is how does syscall works, independently of the registers or arguments passed. How it knows where to jump, return etc when an other process is called.

like image 508
Bryan Jimenez Chacon Avatar asked Jul 02 '19 14:07

Bryan Jimenez Chacon


People also ask

How does the syscall instruction work?

The syscall instruction transfers control to the operating system which then performs the requested service. Then control (usually) returns to the program. (This description leaves out many details). The syscall instruction causes an exception, which transfers control to an exception handler.

How does the kernel know which system call was invoked?

The kernel doesn't monitor the process to detect a system call. Instead, the process generates an interrupt which transfers control to the kernel, because that's what software-generated interrupts do according to the instruction set reference manual.

How does Linux syscall work?

A system call is a function that allows a process to communicate with the Linux kernel. It's just a programmatic way for a computer program to order a facility from the operating system's kernel. System calls expose the operating system's resources to user programs through an API (Application Programming Interface).

How is a syscall executed?

System call is made by sending a trap signal to the kernel, which reads the system call code from the register and executes the system call. Major type of sytem calls are Process Control, File Management, Device Management, Information maintenance and Communicaiton.


1 Answers

syscall

The syscall instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:

IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
  THEN #UD;
FI;
RCX ← RIP;
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH
CS.Base ← 0;
CS.Limit ← FFFFFH;
CS.Type ← 11;
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
CS.D ← 0;
CS.G ← 1;
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
SS.Base ← 0;
SS.Limit ← FFFFFH;
SS.Type ← 3;
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
SS.G ← 1;

The most important part are the two instructions that save and manage the RIP register:

RCX ← RIP
RIP ← IA32_LSTAR

So in other words, there must be code at the address saved in IA32_LSTAR (a register) and RCX is the return address.

The CS and SS segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)

The #UD may happen if you do not have the right to execute syscall or if the instruction doesn't exist.

How is RAX interpreted?

This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if RAX > __NR_syscall_max), then dispatches to (C syntax) sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);

; Intel-syntax translation of Linux 4.12 syscall entry point
       ...                 ; save user-space registers etc.
    call   [sys_call_table + rax * 8]       ; dispatch to sys_execve() or whatever kernel C function

;;; execve probably won't return via this path, but most other calls will
       ...                 ; restore registers except RAX return value, and return to user-space

Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of call *sys_call_table(, %rax, 8) from ENTRY(entry_SYSCALL_64) in Linux 4.12 arch/x86/entry/entry_64.S (before Spectre/Meltdown mitigations were added). Also related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? has some more details about the kernel side of system-call dispatching.

Fast?

The instruction is said to be fast. This is because in the old days one would have to use an instruction such as INT3. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow RTE to exit the exception state and return to the address just after the interrupt. This is generally much slower.

With the syscall you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.

Another instruction which is used along syscall is swapgs. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.

New Process?

The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.

When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:

  • Make sure the executable exists
  • Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
  • Allocates a stack buffer
  • Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
  • Setup the start address in the task (ELF entry point from the executable)
  • Mark the task as ready (a.k.a. running)

This is, of course, super simplified.

The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current RIP pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.

(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)

All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.

(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)

Determine the Address?

So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by ld). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.

It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.

In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.

Also, if the process is already running and you starting it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as copy on write.)

like image 179
Alexis Wilke Avatar answered Oct 06 '22 00:10

Alexis Wilke