I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC).
So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ...
My questions to the kernel experts...
Thanks.
In general, code that runs in kernel space runs at the same speed as code in user space.
Kernel modules have higher execution privilege. Code that runs in kernel space has greater privilege than code that runs in user space. Driver modules potentially have a much greater impact on the system than user programs.
This kernel provides CPU scheduling, memory management, file management, and other operating system functions through system calls. As both services are implemented under the same address space, this makes operating system execution faster.
Kernel to user-space protection The system memory of Linux is divided into two areas: kernel-space and user-space. This separation serves to provide memory protection and hardware protection from malicious or errant software behaviour.
Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much.
Regarding point 2: you can pin a thread to a specific core by setting its affinity, using sched_setaffinity()
on Linux.
Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using mmap()
. This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that kmalloc()
allocates wired (non-pageable) memory. I don't see how this would help.
I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.
Create a dedicated cpuset
for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like:
struct sched_param schedparams;
// Be portable - don't just set priority to 99 :)
schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &schedparams);
Don't do that on a single-core system!
Reserve large enough stack space with alloca(3)
and touch all of the allocated stack memory, map more than enough heap space and then use mlock(2)
or mlockall(2)
to pin process memory.
Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from numa(3)
to allocate and keep memory as close to the NUMA node where your program executes as possible.
Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With