I noticed something really strange that I haven't seen before. The basic setup is described in this pseudo code: <pre class="prettyprint lang-none prettyprint-override"><code>TARGET_LOOP_TIME = X loop forever: before = now() payload() payload_time = now() - before sleep(TARGET_LOOP_TIME - payload_time) </code></pre> This setup is fairly common, e.g. for keeping the loop at 60 FPS. The interesting part is: the <code>payload_time</code> depends on the sleep duration! If the <code>TARGET_LOOP_TIME</code> is high and the program will thus sleep a lot, the <code>payload_time</code> is way higher compared to when the program does not sleep at all. To measure this, I wrote this program: <pre class="prettyprint lang-rust prettyprint-override"><code>use std::time::{Duration, Instant}; const ITERS: usize = 100; fn main() { // A dummy variable to prevent the compiler from removing the dummy prime // code. let mut x = 0; // Iterate over different target loop times for loop_time in (1..30).map(|n| Duration::from_millis(n)) { let mut payload_duration = Duration::from_millis(0); for _ in 0..ITERS { let before = Instant::now(); x += count_primes(3_500); let elapsed = before.elapsed(); payload_duration += elapsed; // Sleep the remaining time if loop_time > elapsed { std::thread::sleep(loop_time - elapsed); } } let avg_duration = payload_duration / ITERS as u32; println!("loop_time {:.2?} \t=> {:.2?}", loop_time, avg_duration); } println!("{}", x); } /// Dummy function. fn count_primes(up_to: u64) -> u64 { (2..up_to) .filter(|n| (2..n / 2).all(|d| n % d != 0)) .count() as u64 } </code></pre> I iterate over different target loop times to test (1ms to 30ms) and iterate for <code>ITERS</code> many times. I compiled this with <code>cargo run --release</code>. On my machine (Ubuntu), the program outputs: <pre class="prettyprint lang-none prettyprint-override"><code>loop_time 1.00ms => 3.37ms loop_time 2.00ms => 3.38ms loop_time 3.00ms => 3.17ms loop_time 4.00ms => 3.25ms loop_time 5.00ms => 3.38ms loop_time 6.00ms => 4.05ms loop_time 7.00ms => 4.09ms loop_time 8.00ms => 4.48ms loop_time 9.00ms => 4.43ms loop_time 10.00ms => 4.22ms loop_time 11.00ms => 4.59ms loop_time 12.00ms => 5.53ms loop_time 13.00ms => 5.82ms loop_time 14.00ms => 6.18ms loop_time 15.00ms => 6.32ms loop_time 16.00ms => 6.96ms loop_time 17.00ms => 8.00ms loop_time 18.00ms => 7.97ms loop_time 19.00ms => 8.28ms loop_time 20.00ms => 8.75ms loop_time 21.00ms => 9.70ms loop_time 22.00ms => 9.57ms loop_time 23.00ms => 10.48ms loop_time 24.00ms => 10.29ms loop_time 25.00ms => 10.31ms loop_time 26.00ms => 10.82ms loop_time 27.00ms => 10.84ms loop_time 28.00ms => 10.82ms loop_time 29.00ms => 10.91ms </code></pre> I made a plot of those numbers (the <code>sleep_time</code> is <code>max(0, loop_time - avg_duration)</code>): <img src="https://i.stack.imgur.com/AbvrL.png" alt="enter image description here"> When the program does not sleep at all, the payload needs approximately 3.3ms (as the first three measurements show). As soon as the loop starts to sleep after the payload, the payload duration rises! In fact, it increases to roughly 10.5ms where it stays. Sleeping even longer does not increase the payload time. Why? Why does the execution time of the piece of code depend on something I do afterwards (or before)? That does not make sense to me! It looks like the CPU says "I'm gonna sleep afterwards anyway, so let's take it slowly". I thought about caching effects, especially of the instruction cache, but loading instruction data from the main memory does not take 7ms! Something else is going on here! Is there a way to fix this? I.e. to make the payload execute as quickly as possible regardless of sleep time?

I am quite sure that this is caused by the CPU throttling. When the OS scheduler detects there is little work to do, the CPU frequency lowers to save power. When you do a lot of <code>sleep</code>s, you are telling the scheduler that you are not in such a hurry, and the CPU can take it easy. You can see that this is the case by running a CPU intensive task in another window, with low priority. For example, in Linux you can run: <pre class="prettyprint lang-none prettyprint-override"><code>$ nice bash -c 'while true ; do : ; done' </code></pre> And at the same time, in another window run your program: <pre class="prettyprint lang-none prettyprint-override"><code>$ cargo run --release loop_time 1.00ms => 3.13ms loop_time 2.00ms => 3.17ms loop_time 3.00ms => 3.19ms loop_time 4.00ms => 3.13ms loop_time 5.00ms => 3.16ms loop_time 6.00ms => 3.22ms loop_time 7.00ms => 3.14ms loop_time 8.00ms => 3.15ms loop_time 9.00ms => 3.13ms loop_time 10.00ms => 3.18ms loop_time 11.00ms => 3.14ms loop_time 12.00ms => 3.17ms loop_time 13.00ms => 3.15ms ... </code></pre> Avoiding this depends on your OS. For example, in Linux, you can fiddle with <code>sys/devices/system/cpu/*</code> options. I think that <code>UPower</code> provides some functions to manage it from a non-root application. It would be nice if there were a crate that managed this cross-system, but I don't know of any. An easy but hacky way to fix this, if you don't mind the wasted power, is just to run an idle thread with a busy loop. <pre class="prettyprint lang-rust prettyprint-override"><code>std::thread::spawn(|| { use thread_priority::*; //external crate thread-priority let thread_id = thread_native_id(); set_thread_priority( thread_id, ThreadPriority::Min, ThreadSchedulePolicy::Normal(NormalThreadSchedulePolicy::Idle), ) .unwrap(); loop {} }); </code></pre> Naturally, if you just want to avoid throttling in this piece of code, you can do a busy wait: <pre class="prettyprint lang-rust prettyprint-override"><code> //if loop_time > elapsed { // std::thread::sleep(loop_time - elapsed); //} // Busy-wait the remaining time, to avoid CPU throttling while loop_time > before.elapsed() { //you may want to try both with and without yield std::thread::yield_now(); } </code></pre>

Why does sleeping increase the execution time of an independent piece of code that is executed before/afterwards?

Tags:

performance

rust

I noticed something really strange that I haven't seen before. The basic setup is described in this pseudo code:

Click to copy

TARGET_LOOP_TIME = X

loop forever:
    before = now()
    payload()
    payload_time = now() - before
    sleep(TARGET_LOOP_TIME - payload_time)

This setup is fairly common, e.g. for keeping the loop at 60 FPS. The interesting part is: the payload_time depends on the sleep duration! If the TARGET_LOOP_TIME is high and the program will thus sleep a lot, the payload_time is way higher compared to when the program does not sleep at all.

To measure this, I wrote this program:

Click to copy

use std::time::{Duration, Instant};

const ITERS: usize = 100;

fn main() {
    // A dummy variable to prevent the compiler from removing the dummy prime
    // code.
    let mut x = 0;

    // Iterate over different target loop times
    for loop_time in (1..30).map(|n| Duration::from_millis(n)) {
        let mut payload_duration = Duration::from_millis(0);

        for _ in 0..ITERS {
            let before = Instant::now();
            x += count_primes(3_500);
            let elapsed = before.elapsed();
            payload_duration += elapsed;

            // Sleep the remaining time
            if loop_time > elapsed {
                std::thread::sleep(loop_time - elapsed);
            }
        }

        let avg_duration = payload_duration / ITERS as u32;
        println!("loop_time {:.2?}  \t=> {:.2?}", loop_time, avg_duration);
    }

    println!("{}", x);
}

/// Dummy function.
fn count_primes(up_to: u64) -> u64 {
    (2..up_to)
        .filter(|n| (2..n / 2).all(|d| n % d != 0))
        .count() as u64
}

I iterate over different target loop times to test (1ms to 30ms) and iterate for ITERS many times. I compiled this with cargo run --release. On my machine (Ubuntu), the program outputs:

Click to copy

loop_time 1.00ms    => 3.37ms
loop_time 2.00ms    => 3.38ms
loop_time 3.00ms    => 3.17ms
loop_time 4.00ms    => 3.25ms
loop_time 5.00ms    => 3.38ms
loop_time 6.00ms    => 4.05ms
loop_time 7.00ms    => 4.09ms
loop_time 8.00ms    => 4.48ms
loop_time 9.00ms    => 4.43ms
loop_time 10.00ms   => 4.22ms
loop_time 11.00ms   => 4.59ms
loop_time 12.00ms   => 5.53ms
loop_time 13.00ms   => 5.82ms
loop_time 14.00ms   => 6.18ms
loop_time 15.00ms   => 6.32ms
loop_time 16.00ms   => 6.96ms
loop_time 17.00ms   => 8.00ms
loop_time 18.00ms   => 7.97ms
loop_time 19.00ms   => 8.28ms
loop_time 20.00ms   => 8.75ms
loop_time 21.00ms   => 9.70ms
loop_time 22.00ms   => 9.57ms
loop_time 23.00ms   => 10.48ms
loop_time 24.00ms   => 10.29ms
loop_time 25.00ms   => 10.31ms
loop_time 26.00ms   => 10.82ms
loop_time 27.00ms   => 10.84ms
loop_time 28.00ms   => 10.82ms
loop_time 29.00ms   => 10.91ms

I made a plot of those numbers (the sleep_time is max(0, loop_time - avg_duration)):

enter image description here

When the program does not sleep at all, the payload needs approximately 3.3ms (as the first three measurements show). As soon as the loop starts to sleep after the payload, the payload duration rises! In fact, it increases to roughly 10.5ms where it stays. Sleeping even longer does not increase the payload time.

Why? Why does the execution time of the piece of code depend on something I do afterwards (or before)? That does not make sense to me! It looks like the CPU says "I'm gonna sleep afterwards anyway, so let's take it slowly". I thought about caching effects, especially of the instruction cache, but loading instruction data from the main memory does not take 7ms! Something else is going on here!

Is there a way to fix this? I.e. to make the payload execute as quickly as possible regardless of sleep time?

741

asked Jul 07 '19 12:07

Lukas Kalbertodt

1 Answers

I am quite sure that this is caused by the CPU throttling. When the OS scheduler detects there is little work to do, the CPU frequency lowers to save power.

When you do a lot of sleeps, you are telling the scheduler that you are not in such a hurry, and the CPU can take it easy.

You can see that this is the case by running a CPU intensive task in another window, with low priority. For example, in Linux you can run:

Click to copy

$ nice bash -c 'while true ; do : ; done'

And at the same time, in another window run your program:

Click to copy

$ cargo run --release
loop_time 1.00ms    => 3.13ms
loop_time 2.00ms    => 3.17ms
loop_time 3.00ms    => 3.19ms
loop_time 4.00ms    => 3.13ms
loop_time 5.00ms    => 3.16ms
loop_time 6.00ms    => 3.22ms
loop_time 7.00ms    => 3.14ms
loop_time 8.00ms    => 3.15ms
loop_time 9.00ms    => 3.13ms
loop_time 10.00ms   => 3.18ms
loop_time 11.00ms   => 3.14ms
loop_time 12.00ms   => 3.17ms
loop_time 13.00ms   => 3.15ms
...

Avoiding this depends on your OS. For example, in Linux, you can fiddle with sys/devices/system/cpu/* options. I think that UPower provides some functions to manage it from a non-root application. It would be nice if there were a crate that managed this cross-system, but I don't know of any.

An easy but hacky way to fix this, if you don't mind the wasted power, is just to run an idle thread with a busy loop.

Click to copy

std::thread::spawn(|| {
    use thread_priority::*; //external crate thread-priority
    let thread_id = thread_native_id();
    set_thread_priority(
        thread_id,
        ThreadPriority::Min,
        ThreadSchedulePolicy::Normal(NormalThreadSchedulePolicy::Idle),
    )
    .unwrap();
    loop {}
});

Naturally, if you just want to avoid throttling in this piece of code, you can do a busy wait:

Click to copy

    //if loop_time > elapsed {
    //    std::thread::sleep(loop_time - elapsed);
    //}
    // Busy-wait the remaining time, to avoid CPU throttling
    while loop_time > before.elapsed() {
        //you may want to try both with and without yield
        std::thread::yield_now();
    }

answered Oct 23 '22 10:10

rodrigo

Related questions
                            
                                Accurately testing Pypy vs CPython performance
                            
                                How to improve a push data pipeline in C# to match F# in performance
                            
                                Algorithm to get changes between two arrays
                            
                                Why does CUDA code run so much faster in NVIDIA Visual Profiler?
                            
                                Why is AES encryption / decryption more than 3x slower on Android 24+?
                            
                                How can I measure wasted renders in React 16?
                            
                                .NET 4.0 Memory Mapped Files Performance
                            
                                numpy on multicore hardware
                            
                                Optimal performing query for latest record for each N
                            
                                Time in getting single elements from data.table and data.frame objects
                            
                                Best way to communicate a service with an activity (broadcast, callbacks, etc)
                            
                                React 16 : componentDidUpdate Warning: Scheduled a cascading update
                            
                                Dealing with slow Electron startup
                            
                                How to benchmark virtual machines
                            
                                Where should scripts be located in a jQuery Mobile page for best performance?
                            
                                Google Maps Javascript API mobile performance issue
                            
                                Why do I get better performance in Chrome when Dev Console is up?
                            
                                React non-blocking rendering of big chunks of data
                            
                                Efficient random number generation with C++11 <random>
                            
                                Why is my java lambda with a dummy assignment much faster than without it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does sleeping increase the execution time of an independent piece of code that is executed before/afterwards?

Tags:

performance

rust

Lukas Kalbertodt

People also ask

1 Answers

rodrigo

Recent Activity

Donate For Us