Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with temporary matrices and private memory inside OpenCL kernels

I'm currently migrating a rather harry matching pursuit algorithm (that's part of a bigger image-processing algorithm) to OpenCL.

The algorithm uses a few internal matrices and vectors for processing. Half of them are rather small in size (less than 10 columns), but the other half can get rather big depending on the input matrices (n * n, 2n * n etc.).

The definition of all of the internal matrices depend on the input matrices.

Given that there's no local allocation functionality in the standard, I've approached the memory problem by mapping chunks of memory from global memory to the work-item's private memory. I make sure during context setup that the chunks do not overlap so that data consistency is assured at runtime.

This approach doesn't feel right to me. It feels more like a hack.

Did any of you ran into this kind of situation? What was your solution?

like image 731
Paul Irofti Avatar asked Dec 01 '25 18:12

Paul Irofti


2 Answers

Segmenting a global memory buffer like this is fine, although commonly only used for output back to the host. Global memory access typically costs hundreds of instruction cycles, so I would suggest that you:

  1. Allocate the temporary data in __private or __local memory instead. Check CL_DEVICE_LOCAL_MEM_SIZE for the latter, which is typically 16KB-64KB. Bear in mind that __local memory on a multiprocessor is shared across work-groups; if you use too much, even within the CL_DEVICE_LOCAL_MEM_SIZE limit, this will negatively affect the occupancy on the multiprocessor and hence your throughput. The best way to observe this is through experimentation on your workload + device.

  2. If your temporary matrices are too large for __local memory, consider whether you can make each work item smaller, so that it DOES fit and you avoid the considerable overhead of global memory.

  3. If there is some hard constraint on the minimum data footprint of each work item, use __global memory as you describe. However make sure that you:

    • Launch your kernel with plenty of work-groups so that, while some are busy waiting on global memory accesses, others can be scheduled on the multiprocessors ("latency hiding").
    • Coalesce global memory access as far as your vendor supports this. The NVidia OpenCL Best Practice guide goes into some detail, and the >100% performance improvements are very achievable.
like image 125
James Beilby Avatar answered Dec 03 '25 06:12

James Beilby


Your approach seems ok.

You can have a look at NVidias OpenCL best practice guide. In Section 3.2.2 - "Shared Memory" - there is an example of a matrix multiplication. Each working group copies the required data from global memory into local memory.

like image 42
kakTuZ Avatar answered Dec 03 '25 06:12

kakTuZ



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!