Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is multithreaded slower?

So I am trying to write a program that finds prime numbers. The real purpose of the project is just to learn about multithreading. First I wrote a single thread program and it finds up to 13,633,943 in 1 minute. My multithreaded version only got to 10,025,627.

Here is my code for the single threaded program

#include <iostream>

using namespace std;

bool isprime(long num)
{
    long lim = num/2;
    if(num == 1)
    {
        return 0;
    }
    for(long i = 2; i <= lim; i++)
    {
        if (num % i == 0)
        {
            return 0;
        }
        else{ lim = num/i; }
    }
    return 1;
}

int main()
{
    long lim;
    cout << "How many numbers should I test: ";
    cin >> lim;
    for(long i = 1; i <= lim || lim == 0; i++)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
}

Here is my code for my multithreaded program.

extern"C"
{
    #include <pthread.h>
    #include <unistd.h>
}
#include <iostream>

using namespace std;

bool isprime(long num);
void * iter1(void * arg);
void * iter2(void * arg);
void * iter3(void * arg);
void * iter4(void * arg);


int main()
{
    //long lim;
    //cout << "How many numbers should I test: ";
    //cin >> lim;
    pthread_t t1;
    char mem1[4096];//To avoid false sharing. Needed anywhere else?
    pthread_t t2;
    char mem2[4096];//These helped but did not solve problem.
    pthread_t t3;
    pthread_create(&t1, NULL, iter1, NULL);
    pthread_create(&t2, NULL, iter2, NULL);
    pthread_create(&t3, NULL, iter3, NULL);
    iter4(0);
}

bool isprime(long num)
{
    long lim = num/2;
    if(num == 1)
    {
        return 0;
    }
    for(long i = 2; i <= lim; i++)
    {
        if (num % i == 0)
        {
            return 0;
        }
        else{ lim = num/i; }
    }
    return 1;
}

void * iter1(void * arg)
{
    for(long i = 1;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

void * iter2(void * arg)
{
    for(long i = 2;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

void * iter3(void * arg)
{
    for(long i = 3;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

void * iter4(void * arg)
{
    for(long i = 4;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

Something that especially confuses me is that system monitor reports 25% CPU usage for the single thread and 100% usage for the multithread. Shouldn't that mean it is doing 4 times as many calculation?

like image 361
Dmobb Jr. Avatar asked Jun 06 '13 14:06

Dmobb Jr.


2 Answers

I'm fairly sure cout acts a shared resource - and even if it actually prints each number correctly and in the right order, it slows things down VERY much to do so.

I have done something similar (it is more flexible, and uses an atomic operation to "pick the next number"), and it's almost exactly 4x faster on my quad core machine. But that's only if I don't print anything. If it prints to the console, it's a lot slower - because a lot of the time is used shuffling pixels rather than actually calculating.

Comment out the cout << i << endl; line, and it will run much quicker.

Edit: using my test program, with printing:

Single thread: 15.04s. 
Four threads: 11.25s

Without printing:

Single threads: 12.63s.
Four threads: 3.69s.

3.69 * 4 = 14.76s, but the time command on my Linux machine shows 12.792s total runtime, so there is obviously a little bit of time when all threads aren't running - or some accounting errors...

like image 199
Mats Petersson Avatar answered Sep 30 '22 11:09

Mats Petersson


I think a lot of your current problem is that you're taking the part that can really operate multi-threaded (finding the primes) and burying it in noise (the time to write the output to the console).

To get an idea of how much effect this has, I rewrote your main a little bit to separate printing the primes from finding the primes. To make timing easier, I also had it take the limit from the command line instead of interactively, giving this:

int main(int argc, char **argv) {
    if (argc != 2) {
        std::cerr << "Usage: bad_prime <limit:long>\n";
        return 1;
    }
    std::vector<unsigned long> primes;

    unsigned long lim = atol(argv[1]);

    clock_t start = clock();

    for(unsigned long i = 1; i <= lim; i++)
        if(isprime(i))
            primes.push_back(i);
    clock_t stop = clock();

    for (auto a : primes)
        std::cout << a << "\t";

    std::err << "\nTime to find primes: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}

Skipping the thousands of lines of the primes themselves, I get a result like this:

Time to find primes: 0.588


Real    48.206
User    1.68481
Sys     3.40082

So -- roughly half a second to find the primes, and over 47 seconds to print them. Assuming the intent really is to write the output to the console, we might as well stop right there. Even if multithreading could completely eliminate the time to find the primes, we'd still only change the ultimate time from ~48.2 seconds to ~47.6 seconds -- unlikely to be worthwhile.

For the moment, therefore, I'll assume the real intent is to write the output to something like a file. Since it seems pretty pointless to go to the work of making code multi-threaded, but run horribly inefficient code in each thread, I thought I'd optimize (or, at least, de-pessimize) the single-threaded code as a starting point.

First, I removed the endl and replaced it with "\n". With the output directed to a file, this reduced the run-time from 0.968 seconds to 0.678 seconds -- endl flushes the buffer in addition to writing a newline, and that buffer flushing accounted for roughly one third of the time taken by program overall.

On the same basis, I took the liberty of rewriting your isprime to something that's at least a little less inefficient:

bool isprime(unsigned long num) {
    if (num == 2)
        return true;

    if(num == 1 || num % 2 == 0)
        return false;

    unsigned long lim = sqrt(num);

    for(unsigned long i = 3; i <= lim; i+=2)
        if (num % i == 0)
            return false;

    return true;
}

This is certainly open to more improvement (e.g., sieve of Eratosthenes), but it's simple, straightforward, and around two to three times as fast (the times above are based on using this isprime, not yours).

At this point, multithreading the prime finding at least stands a chance of making some sense: with the prime finding taking roughly .5 out of .6 seconds, even if we can only double the speed, we should see a significant difference in overall time.

Separating the output from the prime finding also gives us a much better basis for writing a multi-threaded version of the code. With each thread writing its results to a separate vector, we can get meaningful (not interleaved) output without having to do locking on cout and such -- we compute each chunk separately, then print out each vector in order.

Code for that could look something like this:

#include <iostream>
#include <vector>
#include <time.h>
#include <math.h>
#include <thread>

using namespace std;

bool isprime(unsigned long num) {
    // same as above
}

typedef unsigned long UL;

struct params { 
    unsigned long lower_lim;
    unsigned long upper_lim;
    std::vector<unsigned long> results;

    params(UL l, UL u) : lower_lim(l), upper_lim(u) {}
};

long thread_func(params *p) { 
    for (unsigned long i=p->lower_lim; i<p->upper_lim; i++)
        if (isprime(i))
            p->results.push_back(i);
    return 0;
}

int main(int argc, char **argv) {
    if (argc != 2) {
        std::cerr << "Usage: bad_prime <limit:long>\n";
        return 1;
    }

    unsigned long lim = atol(argv[1]);

    params p[] = {
        params(1, lim/4),
        params(lim/4, lim/2),
        params(lim/2, 3*lim/4),
        params(3*lim/4, lim)
    };

    std::thread threads[] = {
        std::thread(thread_func, p), 
        std::thread(thread_func, p+1),
        std::thread(thread_func, p+2),
        std::thread(thread_func, p+3)
    };

    for (int i=0; i<4; i++) {
        threads[i].join();
        for (UL p : p[i].results)
            std::cout << p << "\n";
    }
}

Running this on the same machine as before (a fairly old dual-core processor), I get:

Real    0.35
User    0.639604
Sys     0

This seems to be scaling extremely well. If all we gained from was multi-core computation, we'd expect to see the time to find the primes divide by 2 (I'm running this on a dual-core processor) and the time to write the data to disk remain constant (multithreading isn't going to speed up my hard drive). Based on that, perfect scaling should give us 0.59/2 + 0.1 = 0.40 seconds.

The (admittedly) minor improvement we're seeing beyond that is mostly likely stemming from the fact that we can start writing the data from thread 1 to the disk while threads 2, 3 and 4 are still finding primes (and likewise, start writing the data from thread 2 while 3 and 4 are still computing, and writing data from thread 3 while thread 4 is still computing).

I suppose I should add that the improvement we're seeing is small enough that it could also be simple noise in the timing. I did, however, run both the single- and multi-threaded versions a number of times, and while there's some variation in both, the multi-threaded version is consistently faster than just the improvement in computation speed should account for.

I almost forgot: to get an idea of how much difference this makes in overall speed, I ran a test to see how long it would take to find the primes up to 13,633,943, which your original version found in one minute. Even though I'm almost certainly using a slower CPU (a ~7 year-old Athlon 64 X2 5200+) this version of the code does that in 12.7 seconds.

One final note: at least for the moment, I've left out the padding you'd inserted to prevent false sharing. Based on the times I'm getting, they don't seem to be necessary (or useful).

like image 36
Jerry Coffin Avatar answered Sep 30 '22 09:09

Jerry Coffin