Dealing with temporary matrices and private memory inside OpenCL kernels

Question

I'm currently migrating a rather harry matching pursuit algorithm (that's part of a bigger image-processing algorithm) to OpenCL.

The algorithm uses a few internal matrices and vectors for processing. Half of them are rather small in size (less than 10 columns), but the other half can get rather big depending on the input matrices (n * n, 2n * n etc.).

The definition of all of the internal matrices depend on the input matrices.

Given that there's no local allocation functionality in the standard, I've approached the memory problem by mapping chunks of memory from global memory to the work-item's private memory. I make sure during context setup that the chunks do not overlap so that data consistency is assured at runtime.

This approach doesn't feel right to me. It feels more like a hack.

Did any of you ran into this kind of situation? What was your solution?

James Beilby · Accepted Answer

Segmenting a global memory buffer like this is fine, although commonly only used for output back to the host. Global memory access typically costs hundreds of instruction cycles, so I would suggest that you:

Allocate the temporary data in __private or __local memory instead. Check CL_DEVICE_LOCAL_MEM_SIZE for the latter, which is typically 16KB-64KB. Bear in mind that __local memory on a multiprocessor is shared across work-groups; if you use too much, even within the CL_DEVICE_LOCAL_MEM_SIZE limit, this will negatively affect the occupancy on the multiprocessor and hence your throughput. The best way to observe this is through experimentation on your workload + device.
If your temporary matrices are too large for __local memory, consider whether you can make each work item smaller, so that it DOES fit and you avoid the considerable overhead of global memory.
If there is some hard constraint on the minimum data footprint of each work item, use __global memory as you describe. However make sure that you:
- Launch your kernel with plenty of work-groups so that, while some are busy waiting on global memory accesses, others can be scheduled on the multiprocessors ("latency hiding").
- Coalesce global memory access as far as your vendor supports this. The NVidia OpenCL Best Practice guide goes into some detail, and the >100% performance improvements are very achievable.

kakTuZ · Answer

Your approach seems ok.

You can have a look at NVidias OpenCL best practice guide. In Section 3.2.2 - "Shared Memory" - there is an example of a matrix multiplication. Each working group copies the required data from global memory into local memory.

Dealing with temporary matrices and private memory inside OpenCL kernels

Tags:

image-processing

matrix

signal-processing

opencl

Paul Irofti

2 Answers

James Beilby

kakTuZ

Recent Activity

Donate For Us

Dealing with temporary matrices and private memory inside OpenCL kernels

Tags:

image-processing

matrix

signal-processing

opencl

Paul Irofti

2 Answers

James Beilby

kakTuZ

Related questions

Recent Activity

Donate For Us