Little performance increasing when using multiple threads

Question

I was implementing multithread Jordan-Gauss method of solving a linear system and I saw that running on two threads took only about 15% less time than running on single thread instead of ideal 50%. So I wrote a simple program reproducing this. Here I create a matrix 2000x2000 and give 2000/THREADS_NUM lines to each thread to make some calculations with them.

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <time.h>

#ifndef THREADS_NUM
#define THREADS_NUM 1
#endif

#define MATRIX_SIZE 2000


typedef struct {
    double *a;
    int row_length;
    int rows_number;
} TWorkerParams;

void *worker_thread(void *params_v)
{
    TWorkerParams *params = (TWorkerParams *)params_v;
    int row_length = params->row_length;
    int i, j, k;
    int rows_number = params->rows_number;
    double *a = params->a;

    for(i = 0; i < row_length; ++i) // row_length is always the same
    {
        for(j = 0; j < rows_number; ++j) // rows_number is inverse proportional
                                         // to the number of threads
        {
            for(k = i; k < row_length; ++k) // row_length is always the same
            {
                a[j*row_length + k] -= 2.;
            }
        }
    }
    return NULL;
}


int main(int argc, char *argv[])
{
    // The matrix is of size NxN
    double *a =
        (double *)malloc(MATRIX_SIZE * MATRIX_SIZE * sizeof(double));
    TWorkerParams *params =
        (TWorkerParams *)malloc(THREADS_NUM * sizeof(TWorkerParams));
    pthread_t *workers = (pthread_t *)malloc(THREADS_NUM * sizeof(pthread_t));
    struct timespec start_time, end_time;
    int rows_per_worker = MATRIX_SIZE / THREADS_NUM;
    int i;
    if(!a || !params || !workers)
    {
        fprintf(stderr, "Error allocating memory
");
        return 1;
    }
    for(i = 0; i < MATRIX_SIZE*MATRIX_SIZE; ++i)
        a[i] = 4. * i; // just an example matrix
    // Initializtion of matrix is done, now initialize threads' params
    for(i = 0; i < THREADS_NUM; ++i)
    {
        params[i].a = a + i * rows_per_worker * MATRIX_SIZE;
        params[i].row_length = MATRIX_SIZE;
        params[i].rows_number = rows_per_worker;
    }
    // Get start time
    clock_gettime(CLOCK_MONOTONIC, &start_time);
    // Create threads
    for(i = 0; i < THREADS_NUM; ++i)
    {
        if(pthread_create(workers + i, NULL, worker_thread, params + i))
        {
            fprintf(stderr, "Error creating thread
");
            return 1;
        }
    }
    // Join threads
    for(i = 0; i < THREADS_NUM; ++i)
    {
        if(pthread_join(workers[i], NULL))
        {
            fprintf(stderr, "Error creating thread
");
            return 1;
        }
    }
    clock_gettime(CLOCK_MONOTONIC, &end_time);
    printf("Duration: %lf msec.
", (end_time.tv_sec - start_time.tv_sec)*1e3 +
            (end_time.tv_nsec - start_time.tv_nsec)*1e-6);
    return 0;
}

Here how I compile it:

gcc threads_test.c -o threads_test1 -lrt -pthread -DTHREADS_NUM=1 -Wall -Werror -Ofast
gcc threads_test.c -o threads_test2 -lrt -pthread -DTHREADS_NUM=2 -Wall -Werror -Ofast

Now when I run I get:

./threads_test1
Duration: 3695.359552 msec.
./threads_test2
Duration: 3211.236612 msec.

Which means 2-thread program runs 13% faster than single-thread, even though there is no synchronization between threads and they don't share any memory. I found this answer: https://stackoverflow.com/a/14812411/5647501 and thought that here may be some issues with processor cache, so I added padding, but still result remained the same. I changed my code as follows:

typedef struct {
    double *a;
    int row_length;
    int rows_number;
    volatile char padding[64 - 2*sizeof(int)-sizeof(double)];
} TWorkerParams;

#define VAR_SIZE (sizeof(int)*5 + sizeof(double)*2)
#define MEM_SIZE ((VAR_SIZE / 64 + 1) * 64  )
void *worker_thread(void *params_v)
{
    TWorkerParams *params = (TWorkerParams *)params_v;
    volatile char memory[MEM_SIZE];
    int *row_length  =      (int *)(memory + 0);
    int *i           =      (int *)(memory + sizeof(int)*1);
    int *j           =      (int *)(memory + sizeof(int)*2);
    int *k           =      (int *)(memory + sizeof(int)*3);
    int *rows_number =      (int *)(memory + sizeof(int)*4);
    double **a        = (double **)(memory + sizeof(int)*5);

    *row_length = params->row_length;
    *rows_number = params->rows_number;
    *a = params->a;

    for(*i = 0; *i < *row_length; ++*i) // row_length is always the same
    {
        for(*j = 0; *j < *rows_number; ++*j) // rows_number is inverse proportional
                                         // to the number of threads
        {
            for(*k = 0; *k < *row_length; ++*k) // row_length is always the same
            {
                (*a + *j * *row_length)[*k] -= 2. * *k;
            }
        }
    }
    return NULL;
}

So my question is: why do I get only 15% speed-up instead of 50% when using two threads here? Any help or suggestion will be appreciated. I am running 64-bit Ubuntu Linux, kernel 3.19.0-39-generic, CPU Intel Core i5 4200M (two physical cores with multithreading), but I also tested it on two other machines with the same result.

EDIT: If I replace a[j*row_length + k] -= 2.; with a[0] -= 2.;, I get expected speed-up:

./threads_test1
Duration: 1823.689481 msec.
./threads_test2
Duration: 949.745232 msec.

EDIT 2: Now, when I replaced it with a[k] -= 2.; I get the following:

./threads_test1
Duration: 1039.666979 msec.
./threads_test2
Duration: 1323.460080 msec.

This one I can't get at all.

EmDroid · Accepted Answer

This is a classic issue, switch the i and j for loops.

You are iterating through columns first and in the inner loop you process rows, that means you have much more cache misses than necessary.

My results with the original code (the first version without padding):

$ ./matrix_test1
Duration: 4620.799763 msec.
$ ./matrix_test2
Duration: 2800.486895 msec.

(better improvement than yours actually)

After switching the for loops for i and j:

$ ./matrix_test1
Duration: 1450.037651 msec.
$ ./matrix_test2
Duration: 728.690853 msec.

Here the 2-times speedup.

EDIT: In the fact the original is not that bad because the k index still goes through the row iterating columns, but is is still much better to iterate the row in the outer loop. And when the i rises, you are processing less and less items in the most inner loop, so it still matters.

EDIT2: (removed the block solution because it was actually producing different results) - but it still should be possible to utilize blocks to improve cache performance.

marcS · Answer

Do you speak about 13 % of speed up, but what is the time elapsed on your calculus fonction and not in the rest of programm.

You could start to estimate only the time passed on the calcul method without the time of thread management. It's possible that you lose an important part of your time in the thread managmement. That's could explain the small speed up that you obtained.

In other part, 50% of speed up with 2 threads it's quite impossible to obtain.

Little performance increasing when using multiple threads

Tags:

c++

c

linux

multithreading

pthreads

Matvey

2 Answers

EmDroid

marcS

Recent Activity

Donate For Us

Little performance increasing when using multiple threads

Tags:

c++

c

linux

multithreading

pthreads

Matvey

2 Answers

EmDroid

marcS

Related questions

Recent Activity

Donate For Us