Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iOS Metal compute pipeline slower than CPU implementation for search task

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).

CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).

To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)! I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.

iPhone 6:

CPU 1 core:  approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)

I can see Metal shader spending more than 90% of accessing memory (see below).

What can be done to optimise it?

Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.

METAL IMPLEMENTATION DETAILS:

Host code gist: https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d

Kernel (shader) code: https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126

GPU frame capture profiling results:

enter image description here

like image 639
Lukasz Avatar asked May 25 '15 20:05

Lukasz


1 Answers

The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.

Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.

like image 90
Ian Ollmann Avatar answered Sep 28 '22 02:09

Ian Ollmann