Reduce matrix rows with CUDA

Question

Windows 7, NVidia GeForce 425M.

I wrote a simple CUDA code which calculates the row sums of a matrix. The matrix has uni-dimensional representation (pointer to a float).

The serial version of code is below (it has 2 loops, as expected):

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

and the kernel function which performs the parallel sum of the rows (still has 1 loop):

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

So far so good. The serial and parallel (CUDA) results are equal.

The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).

IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.

Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:

Matrix: nrow = 90000 x ncol = 1000

Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.

CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.

CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.

Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.

I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.

Since I am far from being a CUDA expert:

Am I coding the fastest version for this task? How could I improve my code? Can I get rid of the loop in the kernel function?

Any thoughts appreciated.

EDIT 1

Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.

EDIT 2

As suggest by users other users and here endorsed:

FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

kangshiyin · Accepted Answer

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.

Kernel Approach

Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.

Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.

For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by @Novak. This may improve the performance more, especially for matrices with bad shape.

Thrust Approach

General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.

However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.

cuBLAS Approach

Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.

y = A * ones(size(A,2),1);

where y is the sum of rows of A.

cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.

Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

Vitality · Answer

Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.

APPROACH #1 - reduce_by_key

This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.

APPROACH #2 - transform

This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.

APPROACH #3 - inclusive_scan_by_key

This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.

APPROACH #4 - cublas<t>gemv

It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.

THE FULL CODE

Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

// --- Required for approach #2
__device__ float *vals;

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {

    const int Ncols;    // --- Number of columns

    row_reduction(int _Ncols) : Ncols(_Ncols) {}

    __device__ float operator()(float& x, int& y ) {
        float temp = 0.f;
        for (int i = 0; i<Ncols; i++)
            temp += vals[i + (y*Ncols)];
        return temp;
    }
};

/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
    T C;
    __host__ __device__ MulC(T c) : C(c) { }
    __host__ __device__ T operator()(T x) { return x * C; }
};

/********/
/* MAIN */
/********/
int main()
{
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    TimingGPU timerGPU;

    /***************/
    /* APPROACH #1 */
    /***************/
    timerGPU.StartCounter();
    // --- Allocate space for row sums and indices
    thrust::device_vector<float> d_row_sums(Nrows);
    thrust::device_vector<int> d_row_indices(Nrows);

    // --- Compute row sums by summing values with equal row indices
    //thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
    //                    thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
    //                    d_matrix.begin(),
    //                    d_row_indices.begin(),
    //                    d_row_sums.begin(),
    //                    thrust::equal_to<int>(),
    //                    thrust::plus<float>());

    thrust::reduce_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                thrust::make_discard_iterator(),
                d_row_sums.begin());

    printf("Timing for approach #1 = %f
", timerGPU.GetCounter());

    // --- Print result
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums[i] << "
";
    }

    /***************/
    /* APPROACH #2 */
    /***************/
    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_2(Nrows, 0);
    float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
    gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
    thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0),  d_row_sums_2.begin(), row_reduction(Ncols));

    printf("Timing for approach #2 = %f
", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_2[i] << "
";
    }

    /***************/
    /* APPROACH #3 */
    /***************/

    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_3(Nrows, 0);
    thrust::device_vector<float> d_temp(Nrows * Ncols);
    thrust::inclusive_scan_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                d_temp.begin());
    thrust::copy(
                thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
    thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
                d_row_sums_3.begin());

    printf("Timing for approach #3 = %f
", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_3[i] << "
";
    }

    /***************/
    /* APPROACH #4 */
    /***************/
    cublasHandle_t handle;

    timerGPU.StartCounter();
    cublasSafeCall(cublasCreate(&handle));

    thrust::device_vector<float> d_row_sums_4(Nrows);
    thrust::device_vector<float> d_ones(Ncols, 1.f);

    float alpha = 1.f;
    float beta  = 0.f;
    cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols, 
                               thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));

    printf("Timing for approach #4 = %f
", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_4[i] << "
";
    }

    return 0;
}

TIMING RESULTS (tested on a Kepler K20c)

Matrix size       #1     #1-v2     #2     #3     #4     #4 (no plan)
100  x 100        0.63   1.00     0.10    0.18   139.4  0.098
1000 x 1000       1.25   1.12     3.25    1.04   101.3  0.12
5000 x 5000       8.38   15.3     16.05   13.8   111.3  1.14

 100 x 5000       1.25   1.52     2.92    1.75   101.2  0.40    

5000 x 100        1.35   1.99     0.37    1.74   139.2  0.14

It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.

Reduce matrix rows with CUDA

Tags:

c

matrix

cuda

EDIT 1

EDIT 2

Marcelo Sardelich

2 Answers

Kernel Approach

Thrust Approach

cuBLAS Approach

kangshiyin

Vitality

Recent Activity

Donate For Us

Reduce matrix rows with CUDA

Tags:

c

matrix

cuda

EDIT 1

EDIT 2

Marcelo Sardelich

2 Answers

Kernel Approach

Thrust Approach

cuBLAS Approach

kangshiyin

Vitality

Related questions

Recent Activity

Donate For Us