Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling a vector with multiple threads

I need to fill a huge (7734500 elements) std::vector<unsigned int> with random values, and I am trying to do it in parallel with multiple threads to achieve a higher efficiency. Here is the code that I have so far:

std::random_device rd; // seed generator

std::mt19937_64 generator{rd()}; // generator initialized with seed from rd

static const unsigned int NUM_THREADS = 4;


std::uniform_int_distribution<> initialize(unsigned long long int modulus)
{
    std::uniform_int_distribution<> unifDist{0, (int)(modulus-1)};
    return unifDist;
}


void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end, std::uniform_int_distribution<>& dist)
{
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = dist(generator);
    }
}


std::vector<unsigned int> uniformRandomVector
    (unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
    std::uniform_int_distribution<> dist = initialize(modulus);

    std::thread threads[NUM_THREADS];

    std::vector<unsigned int> v;
    v.resize(rows*columns);

    // number of entries each thread will take care of
    unsigned int positionsEachThread = rows*columns/NUM_THREADS;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        threads[i] = std::thread(unifRandVectorThreadRoutine, v, i*positionsEachThread,
            (i+1)*positionsEachThread, dist);
        // threads[i].join();
    }

    // last thread
    threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, v,
        (NUM_THREADS-1)*positionsEachThread, rows*columns, dist);
    // threads[NUM_THREADS - 1].join();

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    return v;
}

At the moment, it takes approximately 0.3 seconds: do you think there is a way to make it more efficient?


Edit: Giving each thread its own generator

I have modified the routine as follows

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end, std::uniform_int_distribution<>& dist)
{
    std::mt19937_64 generator{rd()};
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = dist(generator);
    }
}

and the running time dropped by one half. So I am still sharing the std::random_device but each thread has its own std::mt19937_64.


Edit: Giving each thread its own vector and then concatenating

I changed the code as follows:

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int length,
    std::uniform_int_distribution<>& dist)
{
    vector.reserve(length);
    std::mt19937_64 generator{rd()};
    for(unsigned int i = 0 ; i < length ; ++i)
    {
        vector.push_back(dist(generator));
    }
}

and

std::vector<unsigned int> uniformRandomVector
    (unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
    std::uniform_int_distribution<> dist = initialize(modulus);

    std::thread threads[NUM_THREADS];

    std::vector<unsigned int> v[NUM_THREADS];

    unsigned int positionsEachThread = rows*columns/NUM_THREADS;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        threads[i] = std::thread(unifRandVectorThreadRoutine, std::ref(v[i]), positionsEachThread, dist);
    }

    // last thread
    threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, std::ref(v[NUM_THREADS-1]),
        rows*columns - (NUM_THREADS-1)*positionsEachThread, dist);

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    std::vector<unsigned int> finalVector;
    finalVector.reserve(rows*columns);

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        finalVector.insert(finalVector.end(), v[i].begin(), v[i].end());
    }

    return finalVector;
}

The execution time is slightly worse than before, when I was using just one vector shared between all the threads. Am I missing something or can it just happen?


Edit: using a different PRNG + benchmarks

Using a different PRNG (as suggested in some comments/answers) helps a lot: I tried with the xorshift+ and here is the implementation I am using:

class xorShift128PlusGenerator
{
public:
    xorShift128PlusGenerator()
    {
        state[0] = rd();
        state[1] = rd();
    };


    unsigned long int next()
    {
        unsigned long int x = state[0];
        unsigned long int const y = state[1];
        state[0] = y;
        x ^= x << 23; // a
        state[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
        return state[1] + y;
    }


private:
    std::random_device rd; // seed generator
    unsigned long int state[2];

};

Then the routine is as follows

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end)
{
    xorShift128PlusGenerator prng;
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = prng.next();
    }
}

Since I am now home and I am using a different (and more powerful) machine, I redid the tests to compare the results. Here is what I obtain:

  • Mersenne Twister with one generator per thread: 0.075 seconds
  • xorshift128+ shared between all threads: 0.023 seconds
  • xorshift128+ with one generator per thread: 0.023 seconds

Note: the execution time varies at each repetition. These are just typical values.

So there seems to be no difference if the xorshift generator is shared or not, but with all these improvements the execution time has dropped significantly.

like image 633
minomic Avatar asked Feb 22 '16 10:02

minomic


People also ask

Can multiple threads write to the same array?

The answer is no. Each array element has a region of memory reserved for it alone within the region attributed the overall array. Modifications of different elements therefore do not write to any of the same memory locations.

Can a vector have multiple data types?

A vector will hold an object of a single type, and only a single type.

Can multiple threads run concurrently?

Concurrency and Parallelism In the same multithreaded process in a shared-memory multiprocessor environment, each thread in the process can run concurrently on a separate processor, resulting in parallel execution, which is true simultaneous execution.

Can multiple threads read the same file?

Multiple threads can also read data from the same FITS file simultaneously, as long as the file was opened independently by each thread. This relies on the operating system to correctly deal with reading the same file by multiple processes.


2 Answers

The generator std::mt19937_64 generator{rd()}; is shared between threads. There will be some shared state that requires updating in it, hence contention; there is a data race. You should also look to allow each thread to use its own generator - you will just need to make sure that they generate separate sequences.

You possibly have a cache contention issue around std::vector<unsigned int> v;, it is declared outside the threads and then hit with each iteration of the for loop in each thread. Let each thread have its own vector to fill, once all the threads are done, collate their results in the vector v. Possibly via a std::future will be the quickest. The exact size of the contention depends on the cache line sizes and the size of the vector being used (and segmented).

In this case you fill a large number of elements (7734500) with a comparatively small number of threads (4), the ratio would possibly lead to fewer contentions.

W.r.t. the number threads you could be using, you should consider binding the NUM_THREADS to the hardware concurrency available on the target; i.e. std::thread::hardware_concurrency().

When dealing with this large a number of elements, you could also look to avoid unnecessary initialisations and moving of the results (albeit given the int type, the move is less not noticeable here). The container itself is also something to be aware of; vector requires contiguous memory, so any additional elements (during a coalition phase) could result in memory allocation and copying.

The speed of the random number generator may also have an impact, other implementations and/or algorithms may impact the final execution times significantly enough to be considered.

As always with all performance based questions - the final solution requires measurement. Implement the possible solutions, measure over the target processors and environments and adapt until a suitable performance is found.

like image 169
Niall Avatar answered Sep 24 '22 22:09

Niall


The Mersenne Twister generator (std::mt19937_64) is not too fast. You might consider other generators such as Xorshift+. See, e.g., this question: What is performance-wise the best way to generate random bools? (the discussion there goes beyond just bools).

And you should get rid of the data race in your code. Use one generator per thread.

like image 24
Daniel Langr Avatar answered Sep 24 '22 22:09

Daniel Langr