I am attempting to parallelise a calculation and consolidate the results into a matrix. A large number of calculations are performed and each one contributes to a summed matrix of all the results.
This is a reduction, many frameworks (such as kokkos and cuda) offer support for the reduction of scalars, summing a number from each parallelised calculation. However I want to reduce a matrix.
The resulting matrix scales with the problem size, but always remains far smaller than the number of parallelised calculations. There are always multiple contributions to each matrix entry.
My code is in c++ and I'm currently using the Kokkos framework to achieve the parallelisation.
I tried giving each thread a copy of the matrix, copying all these from the device(gpu) to the host(cpu) and summing them in serial.
As above but I performed the serial summation on one thread on the device(gpu) then copied the summed matrix to the host
I created a device-side matrix with the Kokkos::Atomic memory trait, then had each thread += its contribution to the one matrix. This relies on the atomic matrix access to prevent collisions. I then copy this matrix to the host.
Can anyone recommend a better system than the atomic matrix. Is there a better framework in c++ with such functionality standardised.
Minimal example with atomic matrix:
#include <iostream>
#include<Kokkos_Core.hpp>
int main(int argc, char *argv[]) {
Kokkos::initialize();
int matrix_size = 200;
int batches = 10;
Kokkos::View<double **, Kokkos::MemoryTraits<Kokkos::Atomic>> r("result_matrix", matrix_size,
matrix_size);
for (int batch = 0; batch < batches; batch++) {
Kokkos::parallel_for(
"populate", Kokkos::RangePolicy(0, 10752), KOKKOS_LAMBDA(const int i) {
//calculation goes here
//index and values should be calculated i dependent
r(42, 43) += 0.013;
r(42, 46) += 0.02;
});
}
auto h_r = Kokkos::create_mirror_view(r);
Kokkos::deep_copy(h_r, r);
std::cout << batches * 10752 * 0.013 << " should equal " << h_r(42, 43) << std::endl;
std::cout << batches * 10752 * 0.02 << " should equal " << h_r(42, 46) << std::endl;
}
It looks like you have enough memory to set up a separate accumulator matrix for each CUDA core (about 700M values all together).
Each of the 11M threads can the index of the core it's running on and accumulate into the matrix assigned to that core without using any synchronization. Then you can do a parallel merge of the ~15000 result matrices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With