Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

User space Vs Kernel space program performance difference

I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC).

So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ...

  1. No system call overhead (how many CPU cycles is gained per system call). This is less critical as I am barely using any system call in my program except for allocating memory that too just when the program starts.
  2. Control over scheduling, I can create a kernel thread and make it run on a given core without being thrown away.
  3. I can use kmalloc memory allocation and thus can have more control over memory allocated, may can also control the cache coloring more precisely by controlling allocated memory. Is it worth trying?

My questions to the kernel experts...

  • Have I missed any factors in the above list that can improve performance further?
  • Is it worth trying or it is straight way known that I will NOT get much performance improvement?
  • If performance gain is possible in kernel, is there any estimate how much gain it can be (any theoretical guess)?

Thanks.

like image 700
Nitin Kunal Avatar asked Jun 30 '12 07:06

Nitin Kunal


People also ask

Is kernel space faster than user space?

In general, code that runs in kernel space runs at the same speed as code in user space.

Why do kernel modules generally perform better than user space device drivers?

Kernel modules have higher execution privilege. Code that runs in kernel space has greater privilege than code that runs in user space. Driver modules potentially have a much greater impact on the system than user programs.

Why a single address space in a Linux kernel provides more performance than other operating systems?

This kernel provides CPU scheduling, memory management, file management, and other operating system functions through system calls. As both services are implemented under the same address space, this makes operating system execution faster.

Why is it important to keep the kernel separated from user space?

Kernel to user-space protection The system memory of Linux is divided into two areas: kernel-space and user-space. This separation serves to provide memory protection and hardware protection from malicious or errant software behaviour.


2 Answers

Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much.

Regarding point 2: you can pin a thread to a specific core by setting its affinity, using sched_setaffinity() on Linux.

Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using mmap(). This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that kmalloc() allocates wired (non-pageable) memory. I don't see how this would help.

I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.

like image 94
pmdj Avatar answered Sep 24 '22 00:09

pmdj


Create a dedicated cpuset for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like:

struct sched_param schedparams;
// Be portable - don't just set priority to 99 :)
schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &schedparams);

Don't do that on a single-core system!

Reserve large enough stack space with alloca(3) and touch all of the allocated stack memory, map more than enough heap space and then use mlock(2) or mlockall(2) to pin process memory.

Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from numa(3) to allocate and keep memory as close to the NUMA node where your program executes as possible.

Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.

like image 23
Hristo Iliev Avatar answered Sep 21 '22 00:09

Hristo Iliev