I'm trying to determine the granularity I can accurately schedule tasks to occur in C/C++. At the moment I can reliably schedule tasks to occur every 5 microseconds, but I'm trying to see if I can lower this further.
Any advice on how to achieve this / if it is possible would be greatly appreciated.
Since I know timer granularity can often be OS dependent: I am currently running on Linux, but would use Windows if the timing granularity is better (although I don't believe it is, based on what I've found for the QueryPerformanceCounter)
I execute all measurements on bare-metal (no VM). /proc/timer_info
confirms nanosecond timer resolution for my CPU (but I know that doesn't translate to nanosecond alarm resolution)
Current
My current code can be found as a Gist here
At the moment, I'm able to execute a request every 5 microseconds (5000 nanoseconds) with less then 1% late arrivals. When late arrivals do occur, they are typically only one cycle (5000 nanoseconds) behind.
I'm doing 3 things at the moment
Setting the process to real-time priority (some pointed out by @Spudd86 here)
struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 99; // highest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);
Minimizing the timer slack
prctl(PR_SET_TIMERSLACK, 1);
Using timerfds (part of the 2.6 Linux kernel)
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = nanosecondInterval;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
timerfd_settime(timerfd, 0, &timspec, 0);
Possible improvements
Why
I'm currently working on creating a workload generator for a benchmarking engine. The workload generator simulates an arrival rate (X requests / second, etc.) using a Poisson process. From the Poisson process, I can determine the relative times at which requests must be made from the benchmarking engine.
So for instance, at 10 requests a second, we may have requests made at: t = 0.02, 0.04, 0.05, 0.056, 0.09 seconds
These requests need to be scheduled in advance and then executed. As the number of requests per second increases, the granularity required for scheduling these requests increases (thousands of requests per second requires sub-millisecond accuracy). As a result, I'm trying to figure out how to scale this system further.
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real CPU bandwidth to task groups. This uses the cgroup virtual file system and "<cgroup>/cpu. rt_runtime_us" to control the CPU time reserved for each control group.
SCHED_OTHER is the default universal time-sharing scheduler policy used by most processes; SCHED_FIFO and SCHED_RR are intended for special time-critical applications that need precise control over the way in which runnable processes are selected for execution.
The LINUX Kernel used the O(n) scheduler between version 2.4 and 2.6. n is the number of runnable processes in the system. O(n) scheduler divides the processor's time into a unit called epochs. Each task is allowed to use at max 1 epoch.
The task scheduler, sometimes called process scheduler, is the part of the kernel that decides which task to run next. It is responsible for best using system resources to guarantee that multiple tasks are being executed simultaneously. This makes it a core component of any multitasking operating system.
You're very close to the limits of what vanilla Linux will offer you, and it's way past what it can guarantee. Adding the real-time patches to your kernel and tuning for full pre-emption will help give you better guarantees under load. I would also remove any dynamic memory allocation from your time critical code, malloc and friends can (and will) stall for a not-inconsequential (in a real-time sense) period of time if it has to reclaim the memory from the i/o cache. I would also be considering removing swap from that machine to help guarantee performance. Dedicating a processor to your task will help to prevent context switch times but, again, it's no guarantee.
I would also suggest that you be careful with that level of sched_priority, you're above various important bits of Linux there, which can lead to very strange effects.
What you gain from building a realtime kernel is more reliable guarantees (ie lower maximum latency) of the time between an IO/timer event handled by the kernel, and control being passed to your app in response. This comes at the price of lower throughput, and you might notice an increase in your best-case latency times.
However, the only reason for using OS timers to schedule events with high-precision is if you're afraid of burning CPU cycles in a loop while you wait for your next due event. OS timers (especially in MS Windows) are not reliable for high granularity timing events, and are very dependant on the sort of timing/HPET hardware available in your system.
When I require highly accurate event scheduling, I use a hybrid method. First, I measure the worst case latency - that is, the biggest difference between the time I requested to sleep, and the actual clock time after sleeping. Let's call this difference "D". (You can actually do this on-the-fly during normal running, by tracking "D" every time you sleep, with something like "D = (D*7 + lastD) / 8" to produce a temporal average).
Then never request to sleep beyond "N - D*2", where "N" is the time of the next event. When within "D*2" time of the next event, enter a spin loop and wait for "N" to occur.
This eats a lot more CPU cycles, but depending on the accuracy you require, you might be able to get away with a "sched_yield()" in your spin loop, which is more kind to your system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With