What is the best way to perform parallel reduction to consolidate contributions to a matrix?

Question

I am attempting to parallelise a calculation and consolidate the results into a matrix. A large number of calculations are performed and each one contributes to a summed matrix of all the results.

This is a reduction, many frameworks (such as kokkos and cuda) offer support for the reduction of scalars, summing a number from each parallelised calculation. However I want to reduce a matrix.

The resulting matrix scales with the problem size, but always remains far smaller than the number of parallelised calculations. There are always multiple contributions to each matrix entry.

My code is in c++ and I'm currently using the Kokkos framework to achieve the parallelisation.

Attempts

1

I tried giving each thread a copy of the matrix, copying all these from the device(gpu) to the host(cpu) and summing them in serial.

The gpu memory requirement for all the matrices meant I had to perform the calculation in small batches
The copying of data from the device to host was huge and was inefficient, just to sum it later
The runtime ended up being slower than the serial method

2

As above but I performed the serial summation on one thread on the device(gpu) then copied the summed matrix to the host

Still the same gpu memory limit
The serial summation on one gpu thread was very slow

Minimal memory copy time

The runtime just about matched the serial method

3

I created a device-side matrix with the Kokkos::Atomic memory trait, then had each thread += its contribution to the one matrix. This relies on the atomic matrix access to prevent collisions. I then copy this matrix to the host.

This has an atomic operation, compromising the parallelization

Minimal memory copy time
Overall a 20* speed increase, good, but far worse than the theoretical potential of the gpu (A100 with 10752 CUDA cores).
This is the scheme I will proceed with, if I can't do any better

Can anyone recommend a better system than the atomic matrix. Is there a better framework in c++ with such functionality standardised.

Minimal example with atomic matrix:

#include <iostream>
#include<Kokkos_Core.hpp>

int main(int argc, char *argv[]) {
Kokkos::initialize();

int matrix_size = 200;

int batches = 10;


Kokkos::View<double **, Kokkos::MemoryTraits<Kokkos::Atomic>> r("result_matrix", matrix_size,
                                                                matrix_size);

for (int batch = 0; batch < batches; batch++) {

    Kokkos::parallel_for(
        "populate", Kokkos::RangePolicy(0, 10752), KOKKOS_LAMBDA(const int i) { 
        
        //calculation goes here 
        //index and values should be calculated i dependent

            r(42, 43) += 0.013;
            r(42, 46) += 0.02;
    });
}

auto h_r = Kokkos::create_mirror_view(r);
Kokkos::deep_copy(h_r, r);

std::cout << batches * 10752 * 0.013 << " should equal " << h_r(42, 43) << std::endl;

std::cout << batches * 10752 * 0.02 << " should equal " << h_r(42, 46) << std::endl;

}

Matt Timmermans · Accepted Answer

It looks like you have enough memory to set up a separate accumulator matrix for each CUDA core (about 700M values all together).

Each of the 11M threads can the index of the core it's running on and accumulate into the matrix assigned to that core without using any synchronization. Then you can do a parallel merge of the ~15000 result matrices.

What is the best way to perform parallel reduction to consolidate contributions to a matrix?

Tags:

c++

algorithm

parallel-processing

gpu

reduction

Attempts

1

2

3

Neil Butcher

1 Answers

Matt Timmermans

Recent Activity

Donate For Us

What is the best way to perform parallel reduction to consolidate contributions to a matrix?

Tags:

c++

algorithm

parallel-processing

gpu

reduction

Attempts

1

2

3

Neil Butcher

1 Answers

Matt Timmermans

Related questions

Recent Activity

Donate For Us