Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-threaded random_r is slower than single threaded version

The following program is essentially the same as the one described here. When I run and compile the program using two threads (NTHREADS == 2), I get the following run times:

real        0m14.120s
user        0m25.570s
sys         0m0.050s

When it is run with just one thread (NTHREADS == 1), I get run times significantly better even though it is only using one core.

real        0m4.705s
user        0m4.660s
sys         0m0.010s

My system is dual core, and I know random_r is thread safe and I am pretty sure it is non-blocking. When the same program is run without random_r and a calculation of cosines and sines is used as a replacement, the dual-threaded version runs in about 1/2 the time as expected.

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define NTHREADS 2
#define PRNG_BUFSZ 8
#define ITERATIONS 1000000000

void* thread_run(void* arg) {
    int r1, i, totalIterations = ITERATIONS / NTHREADS;
    for (i = 0; i < totalIterations; i++){
        random_r((struct random_data*)arg, &r1);
    }
    printf("%i\n", r1);
}

int main(int argc, char** argv) {
    struct random_data* rand_states = (struct random_data*)calloc(NTHREADS, sizeof(struct random_data));
    char* rand_statebufs = (char*)calloc(NTHREADS, PRNG_BUFSZ);
    pthread_t* thread_ids;
    int t = 0;
    thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
    /* create threads */
    for (t = 0; t < NTHREADS; t++) {
        initstate_r(random(), &rand_statebufs[t], PRNG_BUFSZ, &rand_states[t]);
        pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t]);
    }
    for (t = 0; t < NTHREADS; t++) {
        pthread_join(thread_ids[t], NULL);
    }
    free(thread_ids);
    free(rand_states);
    free(rand_statebufs);
}

I am confused why when generating random numbers the two threaded version performs much worse than the single threaded version, considering random_r is meant to be used in multi-threaded applications.

like image 495
Nixuz Avatar asked Jun 08 '10 19:06

Nixuz


People also ask

Is multithreading faster than single thread?

Multithreading is always faster than serial. Dispatching a cpu heavy task into multiple threads won't speed up the execution. On the contrary it might degrade overall performance. Imagine it like this: if you have 10 tasks and each takes 10 seconds, serial execution will take 100 seconds in total.

Why is multithreading slower than single thread?

Every thread needs some overhead and system resources, so it also slows down performance. Another problem is the so called "thread explosion" when MORE thread are created than cores are on the system. And some waiting threads for the end of other threads is the worst idea for multi threading.

Why is multithreading slower than multiprocessing?

In Multiprocessing, CPU has to switch between multiple programs so that it looks like that multiple programs are running simultaneously. In multithreading, CPU has to switch between multiple threads to make it appear that all threads are running simultaneously. The creation of a process is slow and resource-specific.

Is a multi threaded approach always better than a single threaded approach?

They do not make the computer run faster. All they can do is increase the efficiency of the computer by using time that would otherwise be wasted. In certain types of processing this optimization can increase efficiency and decrease running time.


2 Answers

A very simple change to space the data out in memory:

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));
char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);
pthread_t* thread_ids;
int t = 0;
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
/* create threads */
for (t = 0; t < NTHREADS; t++) {
    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]);
    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);
}

results in a much faster running time on my dual-core machine.

This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's 'machine architecture - what your programming language never told you' talk is worth watching if you've got the time if you don't know about that yet, he demonstrates false sharing starting at around 1:20.

Work out your cache line size, and create each thread's data so it is aligned to it.

It's a bit cleaner to plonk all the thread's data into a struct, and align that:

#define CACHE_LINE_SIZE 64

struct thread_data {
    struct random_data random_data;
    char statebuf[PRNG_BUFSZ];
    char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];
};

int main ( int argc, char** argv )
{
    printf ( "%zd\n", sizeof ( struct thread_data ) );

    void* apointer;

    if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) )
        exit ( 1 );

    struct thread_data* thread_states = apointer;

    memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) );

    pthread_t* thread_ids;

    int t = 0;

    thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) );

    /* create threads */
    for ( t = 0; t < NTHREADS; t++ ) {
        initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data );
        pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data );
    }

    for ( t = 0; t < NTHREADS; t++ ) {
        pthread_join ( thread_ids[t], NULL );
    }

    free ( thread_ids );
    free ( thread_states );
}

with CACHE_LINE_SIZE 64:

refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthread
refugio:$ time bin/nixuz_random_r 
64
63499495
944240966

real    0m1.278s
user    0m2.540s
sys 0m0.000s

Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.

(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)

like image 91
Pete Kirkham Avatar answered Oct 23 '22 01:10

Pete Kirkham


I don't know if this is relevant or not - but i just saw a very similar behavior (order of magnitude slower with 2 threads than with one) ... I basically changed a:

  srand(seed);
  foo = rand();

to a

  myseed = seed;
  foo = rand_r(&myseed);

and that "fixed" it (2 threads is now reliably almost twice as fast - e.g. 19s instead of 35s).

I don't know what the issue could have been -- locking or cache coherence on the internals of rand() maybe? Anyway, there is also a random_r() so maybe that would be of use to you (a year ago) or someone else.

like image 27
Jerry Avatar answered Oct 23 '22 01:10

Jerry