[Update:] I'm spawning multiple processes now and it works fairly well, though the basic threading problem still exists. [/]
I'm trying to thread a c++ (g++ 4.6.1) program that compiles a bunch of opencl kernels. Most of the time taken is spent inside clBuildProgram. (It's genetic programming and actually running the code and evaluating fitness is much much faster.) I'm trying to thread the compilation of these kernels and not having any luck so far. At this point, there's no shared data between threads (aside from having the same platform and device reference), but it will only run one thread at a time. I can run this code as several processes (just launching them in different terminal windows in linux) and it will then use up multiple cores but not within one process. I can use multiple cores with the same basic threading code (std::thread) with just basic math, so I think it's something to do with either the opencl compile or some static data I forgot about. :) Any ideas? I've done my best to make this thread-safe, so I'm stumped.
I'm using AMD's SDK (opencl 1.1, circa 6/13/2010) and a 5830 or 5850 to run it. The SDK and g++ are not as up to date as they could be. The last time I installed a newer linux distro in order to get the newer g++, my code was running at half speed (at least the opencl compiles were), so I went back. (Just checked the code on that install and it runs at half speed still with no threading differences.) Also, when I said it only runs one thread at a time, it will launch all of them and then alternate between two until they finish, then do the next two, etc. And it does look like all of the threads are running until the code gets to building the program. I'm not using a callback function in clBuildProgram. I realize there's a lot that could be going wrong here and it's hard to say without the code. :)
I am pretty sure this problem occurs inside of or in the call of clBuildProgram. I'm printing the time taken inside of here and the threads that get postponed will come back with a long compile time for their first compile. The only shared data between these clBuildProgram calls is the device id, in that each thread's cl_device_id has the same value.
This is how I'm launching the threads:
for (a = 0; a < num_threads; a++) {
threads[a] = std::thread(std::ref(programs[a]));
threads[a].detach();
sleep(1); // giving the opencl init f()s time to complete
}
This is where it's bogging down (and these are all local variables being passed, though the device id will be the same):
clBuildProgram(program, 1, & device, options, NULL, NULL);
It doesn't seem to make a difference whether each thread has a unique context or command_queue. I really suspected this was the problem which is why I mention it. :)
Update: Spawning child processes with fork() will work for this.
You might want to post something on AMD's support forum about that. Considering the many failed OpenGL implementations about thread consistency that the spec requires, it would not surprise me that OpenCL drivers are still suboptimal on that sense. They could use process ID internally to separate data instead, who knows.
If you have a working multi processed generation, then I suggest you keep that, and communicate results using IPC. Either you can use boost::ipc which has interesting ways of using serialization (e.g with boost::spirit to reflect the data structures). Or you could use posix pipes, or shared memory, or just dump compilation results to files and poll the directory from your parent process, using boost::filesystem and directory iterators...
Forked processes may inherit some handles; so there are ways to use unnamed pipes as well I believe, that could help you into avoiding the need to create a pipe server that would instantiate client pipes, which can lead to extensive protocol coding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With