I need to fill a huge (7734500 elements) std::vector<unsigned int>
with random values, and I am trying to do it in parallel with multiple threads to achieve a higher efficiency. Here is the code that I have so far:
std::random_device rd; // seed generator
std::mt19937_64 generator{rd()}; // generator initialized with seed from rd
static const unsigned int NUM_THREADS = 4;
std::uniform_int_distribution<> initialize(unsigned long long int modulus)
{
std::uniform_int_distribution<> unifDist{0, (int)(modulus-1)};
return unifDist;
}
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int start,
unsigned int end, std::uniform_int_distribution<>& dist)
{
for(unsigned int i = start ; i < end ; ++i)
{
vector[i] = dist(generator);
}
}
std::vector<unsigned int> uniformRandomVector
(unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
std::uniform_int_distribution<> dist = initialize(modulus);
std::thread threads[NUM_THREADS];
std::vector<unsigned int> v;
v.resize(rows*columns);
// number of entries each thread will take care of
unsigned int positionsEachThread = rows*columns/NUM_THREADS;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
threads[i] = std::thread(unifRandVectorThreadRoutine, v, i*positionsEachThread,
(i+1)*positionsEachThread, dist);
// threads[i].join();
}
// last thread
threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, v,
(NUM_THREADS-1)*positionsEachThread, rows*columns, dist);
// threads[NUM_THREADS - 1].join();
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
return v;
}
At the moment, it takes approximately 0.3 seconds: do you think there is a way to make it more efficient?
Edit: Giving each thread its own generator
I have modified the routine as follows
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int start,
unsigned int end, std::uniform_int_distribution<>& dist)
{
std::mt19937_64 generator{rd()};
for(unsigned int i = start ; i < end ; ++i)
{
vector[i] = dist(generator);
}
}
and the running time dropped by one half. So I am still sharing the std::random_device
but each thread has its own std::mt19937_64
.
Edit: Giving each thread its own vector and then concatenating
I changed the code as follows:
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int length,
std::uniform_int_distribution<>& dist)
{
vector.reserve(length);
std::mt19937_64 generator{rd()};
for(unsigned int i = 0 ; i < length ; ++i)
{
vector.push_back(dist(generator));
}
}
and
std::vector<unsigned int> uniformRandomVector
(unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
std::uniform_int_distribution<> dist = initialize(modulus);
std::thread threads[NUM_THREADS];
std::vector<unsigned int> v[NUM_THREADS];
unsigned int positionsEachThread = rows*columns/NUM_THREADS;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
threads[i] = std::thread(unifRandVectorThreadRoutine, std::ref(v[i]), positionsEachThread, dist);
}
// last thread
threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, std::ref(v[NUM_THREADS-1]),
rows*columns - (NUM_THREADS-1)*positionsEachThread, dist);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
std::vector<unsigned int> finalVector;
finalVector.reserve(rows*columns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
finalVector.insert(finalVector.end(), v[i].begin(), v[i].end());
}
return finalVector;
}
The execution time is slightly worse than before, when I was using just one vector shared between all the threads. Am I missing something or can it just happen?
Edit: using a different PRNG + benchmarks
Using a different PRNG (as suggested in some comments/answers) helps a lot: I tried with the xorshift+
and here is the implementation I am using:
class xorShift128PlusGenerator
{
public:
xorShift128PlusGenerator()
{
state[0] = rd();
state[1] = rd();
};
unsigned long int next()
{
unsigned long int x = state[0];
unsigned long int const y = state[1];
state[0] = y;
x ^= x << 23; // a
state[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
return state[1] + y;
}
private:
std::random_device rd; // seed generator
unsigned long int state[2];
};
Then the routine is as follows
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int start,
unsigned int end)
{
xorShift128PlusGenerator prng;
for(unsigned int i = start ; i < end ; ++i)
{
vector[i] = prng.next();
}
}
Since I am now home and I am using a different (and more powerful) machine, I redid the tests to compare the results. Here is what I obtain:
Note: the execution time varies at each repetition. These are just typical values.
So there seems to be no difference if the xorshift generator is shared or not, but with all these improvements the execution time has dropped significantly.
The answer is no. Each array element has a region of memory reserved for it alone within the region attributed the overall array. Modifications of different elements therefore do not write to any of the same memory locations.
A vector will hold an object of a single type, and only a single type.
Concurrency and Parallelism In the same multithreaded process in a shared-memory multiprocessor environment, each thread in the process can run concurrently on a separate processor, resulting in parallel execution, which is true simultaneous execution.
Multiple threads can also read data from the same FITS file simultaneously, as long as the file was opened independently by each thread. This relies on the operating system to correctly deal with reading the same file by multiple processes.
The generator std::mt19937_64 generator{rd()};
is shared between threads. There will be some shared state that requires updating in it, hence contention; there is a data race. You should also look to allow each thread to use its own generator - you will just need to make sure that they generate separate sequences.
You possibly have a cache contention issue around std::vector<unsigned int> v;
, it is declared outside the threads and then hit with each iteration of the for loop in each thread. Let each thread have its own vector to fill, once all the threads are done, collate their results in the vector v
. Possibly via a std::future
will be the quickest. The exact size of the contention depends on the cache line sizes and the size of the vector being used (and segmented).
In this case you fill a large number of elements (7734500) with a comparatively small number of threads (4), the ratio would possibly lead to fewer contentions.
W.r.t. the number threads you could be using, you should consider binding the NUM_THREADS
to the hardware concurrency available on the target; i.e. std::thread::hardware_concurrency()
.
When dealing with this large a number of elements, you could also look to avoid unnecessary initialisations and moving of the results (albeit given the int
type, the move is less not noticeable here). The container itself is also something to be aware of; vector
requires contiguous memory, so any additional elements (during a coalition phase) could result in memory allocation and copying.
The speed of the random number generator may also have an impact, other implementations and/or algorithms may impact the final execution times significantly enough to be considered.
As always with all performance based questions - the final solution requires measurement. Implement the possible solutions, measure over the target processors and environments and adapt until a suitable performance is found.
The Mersenne Twister generator (std::mt19937_64
) is not too fast. You might consider other generators such as Xorshift+. See, e.g., this question: What is performance-wise the best way to generate random bools? (the discussion there goes beyond just bools).
And you should get rid of the data race in your code. Use one generator per thread.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With