I encountered a strange difference in the behavior of a program using pthreads between Linux and Mac OS X. Consider the following program that can be compiled with "gcc -pthread -o threadtest threadtest.c": <pre class="prettyprint"><code>#include <pthread.h> #include <stdio.h> #include <stdlib.h> static void *worker(void *t) { int i = *(int *)t; printf("Thread %d started\n", i); system("sleep 1"); printf("Thread %d ends\n", i); return (void *) 0; } int main() { #define N_WORKERS 4 pthread_t workers[N_WORKERS]; int args[N_WORKERS]; int i; for (i = 0; i < N_WORKERS; ++i) { args[i] = i; pthread_create(&workers[i], NULL, worker, args + i); } for (i = 0; i < N_WORKERS; ++i) { pthread_join(workers[i], NULL); } return 0; } </code></pre> Running the resulting executable on a 4-core Mac OS X machine results in the following behavior: <pre class="prettyprint"><code>$ time ./threadtest Thread 0 started Thread 2 started Thread 1 started Thread 3 started Thread 0 ends Thread 1 ends Thread 2 ends Thread 3 ends real 0m4.030s user 0m0.006s sys 0m0.008s </code></pre> Note that the number of actual cores is probably not even relevant, as the time is simply spent in the "sleep 1" shell command without any computation. It is also apparent that the threads are started in parallel as the "Thread ... started" messages appear instantly after the program is started. Running the same test program on a Linux machine gives the result that I expect: <pre class="prettyprint"><code>$ time ./threadtest Thread 0 started Thread 3 started Thread 1 started Thread 2 started Thread 1 ends Thread 2 ends Thread 0 ends Thread 3 ends real 0m1.010s user 0m0.008s sys 0m0.013s </code></pre> Four processes are started in parallel that each sleep for a second, and that takes roughly a second. If I put actual computations into the worker() function and remove the system() call, I see the expected speedup also in Mac OS X. So the question is, why does using the system() call in a thread effectively serialize the execution of the threads on Mac OS X, and how can that be prevented?

@BasileStarynkevitch and @null pointed out that a global mutex in system() implementation in the C library of Mac OS X might be responsible for the observed behavior. @null provided a reference to the potential source file of the system() implementation, where these operations are contained: <pre class="prettyprint"><code>#if __DARWIN_UNIX03 pthread_mutex_lock(&__systemfn_mutex); #endif /* __DARWIN_UNIX03 */ #if __DARWIN_UNIX03 pthread_mutex_unlock(&__systemfn_mutex); #endif /* __DARWIN_UNIX03 */ </code></pre> By disassembling the system() function in lldb I verified that these calls are actually present in the compiled code. The solution is to replace the use of the system() C library function with a combination of the fork()/execve()/waitpid() system calls. A quick proof of concept for the modification of the worker() function in the original example: <pre class="prettyprint"><code>static void *worker(void *t) { static const char shell[] = "/bin/sh"; static const char * const args[] = { shell, "-c", "sleep 1", NULL }; static const char * const env[] = { NULL }; pid_t pid; int i = *(int *)t; printf("Thread %d started\n", i); pid = fork(); if (pid == 0) { execve(shell, (char **) args, (char **) env); } waitpid(pid, NULL, 0); printf("Thread %d ends\n", i); return (void *) 0; } </code></pre> With this modification the test program now executes in approximately one second on Mac OS X.

Why is a multithreaded C program forced to a single CPU on Mac OS X when system() is used in a thread?

Tags:

c++

c

linux

macos

multithreading

I encountered a strange difference in the behavior of a program using pthreads between Linux and Mac OS X.

Consider the following program that can be compiled with "gcc -pthread -o threadtest threadtest.c":

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

static
void *worker(void *t)
{
    int i = *(int *)t;

    printf("Thread %d started\n", i);
    system("sleep 1");

    printf("Thread %d ends\n", i);
    return (void *) 0;
}

int main()
{
#define N_WORKERS   4

    pthread_t       workers[N_WORKERS];
    int                 args[N_WORKERS];
    int         i;

    for (i = 0; i < N_WORKERS; ++i)
    {
        args[i] = i;
        pthread_create(&workers[i], NULL, worker, args + i);
    }

    for (i = 0; i < N_WORKERS; ++i)
    {
        pthread_join(workers[i], NULL);
    }

    return 0;
}

Running the resulting executable on a 4-core Mac OS X machine results in the following behavior:

$ time ./threadtest
Thread 0 started
Thread 2 started
Thread 1 started
Thread 3 started
Thread 0 ends
Thread 1 ends
Thread 2 ends
Thread 3 ends

real    0m4.030s
user    0m0.006s
sys 0m0.008s

Note that the number of actual cores is probably not even relevant, as the time is simply spent in the "sleep 1" shell command without any computation. It is also apparent that the threads are started in parallel as the "Thread ... started" messages appear instantly after the program is started.

Running the same test program on a Linux machine gives the result that I expect:

$ time ./threadtest
Thread 0 started
Thread 3 started
Thread 1 started
Thread 2 started
Thread 1 ends
Thread 2 ends
Thread 0 ends
Thread 3 ends

real    0m1.010s
user    0m0.008s
sys 0m0.013s

Four processes are started in parallel that each sleep for a second, and that takes roughly a second.

If I put actual computations into the worker() function and remove the system() call, I see the expected speedup also in Mac OS X.

So the question is, why does using the system() call in a thread effectively serialize the execution of the threads on Mac OS X, and how can that be prevented?

885

asked Jul 01 '15 10:07

stm

1 Answers

@BasileStarynkevitch and @null pointed out that a global mutex in system() implementation in the C library of Mac OS X might be responsible for the observed behavior. @null provided a reference to the potential source file of the system() implementation, where these operations are contained:

#if __DARWIN_UNIX03
    pthread_mutex_lock(&__systemfn_mutex);
#endif /* __DARWIN_UNIX03 */

#if __DARWIN_UNIX03
    pthread_mutex_unlock(&__systemfn_mutex);
#endif /* __DARWIN_UNIX03 */

By disassembling the system() function in lldb I verified that these calls are actually present in the compiled code.

The solution is to replace the use of the system() C library function with a combination of the fork()/execve()/waitpid() system calls. A quick proof of concept for the modification of the worker() function in the original example:

static
void *worker(void *t)
{
    static const char shell[] = "/bin/sh";
    static const char * const args[] = { shell, "-c", "sleep 1", NULL };
    static const char * const env[] = { NULL };

    pid_t pid;
    int i = *(int *)t;

    printf("Thread %d started\n", i);

    pid = fork();
    if (pid == 0)
    {
        execve(shell, (char **) args, (char **) env);
    }
    waitpid(pid, NULL, 0);

    printf("Thread %d ends\n", i);
    return (void *) 0;
}

With this modification the test program now executes in approximately one second on Mac OS X.

157

answered Nov 12 '22 05:11

stm

Related questions
                            
                                Recursive variadic function template
                            
                                What's the equivalent for while (cin >> var) in python?
                            
                                Unable to create a debugger engine of the type "No engine"
                            
                                Does converting a float to a double and back to float give the same value in C++
                            
                                Capturing camera image with v4l2 very slow
                            
                                Wrapping nested templated types in nim
                            
                                Handling Mac OS X file open event BEFORE C++ main() executes
                            
                                How to call other class' const member function via a std::unique_ptr member
                            
                                C++ check whether constructor contains a parameter of given type
                            
                                creating clickable "buttons" c++
                            
                                no viable conversion from 'value_type' (aka 'char') to 'string' (aka 'basic_string<char, char_traits<char>, allocator<char> >')
                            
                                cuda, OpenGL interoperability: cudaErrorMemoryAllocation error on cudaGraphicsGLRegisterBuffer
                            
                                How to write a range-v3 action for random_shuffle?
                            
                                numpy ctypes "dynamic module does not define init function" error if not recompiled each time
                            
                                Should load-acquire see store-release immediately?
                            
                                How to check a value like "#define VERSION 3.1.4" at compile time?
                            
                                How can I print a newline without flushing the buffer?
                            
                                What is "Class::*"
                            
                                Cannot compile code with clang, but works with gcc
                            
                                Ambiguous call to std/boost move

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With