I've been working on a hobby project for a while (written in C), and it's still far from complete. It's very important that it will be fast, so I recently decided to do some benchmarking to verify that my way of solving the problem wouldn't be inefficient.
$ time ./old
real 1m55.92
user 0m54.29
sys 0m33.24
I redesigned parts of the program to significantly remove unnecessary operations, reduced memory cache misses and branch mispredictions. The wonderful Callgrind tool was showing me more and more impressive numbers. Most of the benchmarking was done without forking external processes.
$ time ./old --dry-run
real 0m00.75
user 0m00.28
sys 0m00.24
$ time ./new --dry-run
real 0m00.15
user 0m00.12
sys 0m00.02
Clearly I was at least doing something right. Yet running the program for real told a different story.
$ time ./new
real 2m00.29
user 0m53.74
sys 0m36.22
As you might have noticed, the time is mostly dependent on the external processes. I don't know what caused the regression. There's nothing really weird about it; just a traditional vfork/execve/waitpid done by a single thread, running the same programs in the same order.
Something had to be causing forking to be slow, so I made a small test (similar to the one below) that would only spawn the new processes and have none of the overhead associated with my program. Obviously this had to be the fastest.
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, const char **argv)
{
static const char *const _argv[] = {"/usr/bin/md5sum", "test.c", 0};
int fd = open("/dev/null", O_WRONLY);
dup2(fd, STDOUT_FILENO);
close(fd);
for (int i = 0; i < 100000; i++)
{
int pid = vfork();
int status;
if (!pid)
{
execve("/usr/bin/md5sum", (char*const*)_argv, environ);
_exit(1);
}
waitpid(pid, &status, 0);
}
return 0;
}
$ time ./test
real 1m58.63
user 0m68.05
sys 0m30.96
I guess not.
At this time I decided to vote performance for governor, and times got better:
$ for i in 0 1 2 3 4 5 6 7; do sudo sh -c "echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor";done
$ time ./test
real 1m03.44
user 0m29.30
sys 0m10.66
It seems like every new process gets scheduled on a separate core and it takes a while for it to switch to a higher frequency. I can't say why the old version ran faster. Maybe it was lucky. Maybe it (due to it's inefficiency) caused the CPU to choose a higher frequency earlier.
A nice side effect of changing governor was that compile times improved too. Apparently compiling requires forking many new processes. It's not a workable solution though, as this program will have to run on other people's desktops (and laptops).
The only way I found to improve the original times was to restrict the program (and child processes) to a single CPU by adding this code at the beginning:
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
Which actually was the fastest despite using the default "ondemand" governor:
$ time ./test
real 0m59.74
user 0m29.02
sys 0m10.67
Not only is it a hackish solution, but it doesn't work well in case the launched program uses multiple threads. There's no way for my program to know that.
Does anyone have any idea for how to get the spawned processes to run at high CPU clock frequency? It has to be automated and not require su priviliges. Though I've only tested this on Linux so far, I intend to port this to more or less all popular and impopular desktop OSes (and it will also run on servers). Any idea on any platform is welcome.
To adjust for only a single CPU core, append -c core_number . The governor, maximum and minimum frequencies can be set in /etc/default/cpupower .
In general, if you do not require CPU frequency scaling, then disable it so as not to impact system performance. Your systems may use significantly more energy when frequency scaling is disabled. The installer allows CPU frequency scaling to be enabled when the cpufreq scaling governor is set to performance .
CPUfreq — also referred to as CPU speed scaling — allows the clock speed of the processor to be adjusted on the fly. This enables the system to run at a reduced clock speed to save power.
The "scaling_governor" feature enables setting a static frequency to the CPU. Frequency value must be between scaling_min_freq and scaling_max_freq. When CPU frequency governor is set to "powersave" mode, CPU is set to the lowest static frequency (within the borders of scaling_min_freq and scaling_max_freq).
CPU frequency is seen (by the most OSs) as a system property. Thus, you can't change it without root rights. There exists some research on extensions to allow an adoption for specific programs; however since the energy/performance model differs even for the same general architecture, you will hardly find a general solution.
In addition, be aware that in order to guarantee fairness, the linux scheduler shares the execution time of perent and child processes for the first epoch of the child. This might have an impact to your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With