I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC). So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ... <ol> <li>No system call overhead (how many CPU cycles is gained per system call). This is less critical as I am barely using any system call in my program except for allocating memory that too just when the program starts.</li> <li>Control over scheduling, I can create a kernel thread and make it run on a given core without being thrown away. </li> <li>I can use kmalloc memory allocation and thus can have more control over memory allocated, may can also control the cache coloring more precisely by controlling allocated memory. Is it worth trying?</li> </ol> My questions to the kernel experts... <ul> <li>Have I missed any factors in the above list that can improve performance further?</li> <li>Is it worth trying or it is straight way known that I will NOT get much performance improvement?</li> <li>If performance gain is possible in kernel, is there any estimate how much gain it can be (any theoretical guess)?</li> </ul> Thanks.

Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much. Regarding point 2: you can pin a thread to a specific core by setting its affinity, using <code>sched_setaffinity()</code> on Linux. Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using <code>mmap()</code>. This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that <code>kmalloc()</code> allocates wired (non-pageable) memory. I don't see how this would help. I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.

Create a dedicated <code>cpuset</code> for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like: <pre class="prettyprint"><code>struct sched_param schedparams; // Be portable - don't just set priority to 99 :) schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO); sched_setscheduler(0, SCHED_FIFO, &schedparams); </code></pre> Don't do that on a single-core system! Reserve large enough stack space with <code>alloca(3)</code> and touch all of the allocated stack memory, map more than enough heap space and then use <code>mlock(2)</code> or <code>mlockall(2)</code> to pin process memory. Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from <code>numa(3)</code> to allocate and keep memory as close to the NUMA node where your program executes as possible. Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.

User space Vs Kernel space program performance difference

Tags:

performance

caching

cache-control

kernel

I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC).

So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ...

No system call overhead (how many CPU cycles is gained per system call). This is less critical as I am barely using any system call in my program except for allocating memory that too just when the program starts.
Control over scheduling, I can create a kernel thread and make it run on a given core without being thrown away.
I can use kmalloc memory allocation and thus can have more control over memory allocated, may can also control the cache coloring more precisely by controlling allocated memory. Is it worth trying?

My questions to the kernel experts...

Have I missed any factors in the above list that can improve performance further?
Is it worth trying or it is straight way known that I will NOT get much performance improvement?
If performance gain is possible in kernel, is there any estimate how much gain it can be (any theoretical guess)?

Thanks.

700

asked Jun 30 '12 07:06

Nitin Kunal

2 Answers

Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much.

Regarding point 2: you can pin a thread to a specific core by setting its affinity, using sched_setaffinity() on Linux.

Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using mmap(). This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that kmalloc() allocates wired (non-pageable) memory. I don't see how this would help.

I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.

answered Sep 24 '22 00:09

pmdj

Create a dedicated cpuset for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like:

struct sched_param schedparams;
// Be portable - don't just set priority to 99 :)
schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &schedparams);

Don't do that on a single-core system!

Reserve large enough stack space with alloca(3) and touch all of the allocated stack memory, map more than enough heap space and then use mlock(2) or mlockall(2) to pin process memory.

Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from numa(3) to allocate and keep memory as close to the NUMA node where your program executes as possible.

Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.

answered Sep 21 '22 00:09

Hristo Iliev

Related questions
                            
                                Poor numpy.cross() performance
                            
                                Javascript: Does modifying scrollTop/scrollLeft trigger browser reflow?
                            
                                Higher level languages with C functions
                            
                                What are the pros and cons of including Javascript right before the </head> tag vs the </body> tag?
                            
                                Fast algorithm for finding prime numbers? [duplicate]
                            
                                Choosing a Data structure for very large data
                            
                                F# vs. C# performance Signatures with sample code
                            
                                z-index, how does it affect performance?
                            
                                json column vs multiple columns
                            
                                PHP PDO vs normal mysqli speed performance benchmark [closed]
                            
                                haskell matrix implemetation performance
                            
                                How to approach Java 2D performance variations between different computers?
                            
                                Azure table storage performance - REST vs. StorageClient
                            
                                Overhead of DLL function call
                            
                                Cassandra and Tombstones: Creating a Row , Deleting the Row, Recreating the Row = Performance?
                            
                                Is Array.Copy() faster than for loop, for 2D arrays?
                            
                                "GLOBAL could be very inefficient"
                            
                                Why spike in query time despite similar number of rows examined?
                            
                                Index a view of a join with Postgresql?
                            
                                Stop event bubbling - increases performance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With