Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird OpenCL calls side effect on C++ for loop performance

Tags:

c++

intel

opencl

I'm working on a C++ project using OpenCL. I'm using the CPU as an OpenCL device with the intel OpenCL runtime

I noticed a weird side effect in calling OpenCL functions. Here is a simple test:

#include <iostream>
#include <cstdio>
#include <vector>
#include <CL/cl.hpp>

int main(int argc, char* argv[])
{
  /*
      cl_int status;
  std::vector<cl::Platform> platforms;
  cl::Platform::get(&platforms);
  std::vector<cl::Device> devices;
  platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
  cl::Context context(devices);
  cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
  status = queue.finish();
  printf("Status: %d\n", status);
*/

  int ch;
  int b = 0;
  int sum = 0;
  FILE* f1;
  f1 = fopen(argv[1], "r");

  while((ch = fgetc(f1)) != EOF)
  {
    sum += ch;
    b++;
    if(b % 1000000 == 0)
      printf("Char %d read\n", b);
  }
  printf("Sum: %d\n", sum);

}

It's a simple loop that reads a file char by char and adds them so the compiler doesn't try to optimize it out.

My system is a Core i7-4770K, 2TB HDD 16GB DDR3 running Ubuntu 14.10. The program above, with a 100MB file as input, takes around 770ms. This is consistent with my HDD speed. So far so good.

If you now invert the comments and run only the OpenCL calls region, it takes around 200ms. Again, so far, so good.

Buf if you uncomment all, the program takes more than 2000ms. I would expect 770ms + 200ms, but it is 2000ms. You can even notice an increased delay between the output messages in the for loop. The two regions (OpenCL calls and reading chars) are supposed to be independent.

I don't understand why using OpenCL interferes with a simple C++ for loop performance. It's not a simple OpenCL initialization delay.

I'm compiling this example with:

g++ weird.cpp -O2 -lOpenCL -o weird

I also tried using Clang++, but it happens the same.

like image 371
pinguino Avatar asked Apr 30 '15 12:04

pinguino


2 Answers

This was an interesting one. It's because getc is made threadsafe version at the point when the queue is instantiated and so the time increase is the grab-release cycle of the locks - I'm not sure why/how this occurs but that is the decisive point on the AMD OpenCL SDK with intel CPUs. I was quite amazed I had essentially the same times as OP.

https://software.intel.com/en-us/forums/topic/337984

You can try a remedy for this specific problem by just changing getc to getc_unlocked.

It brought it back down to 930 ms for me - that time increase over 750ms is mainly spent in platform and context creation lines.

like image 119
Jason Newton Avatar answered Sep 28 '22 09:09

Jason Newton


I believe that the effect is caused by the OpenCL objects still being in scope, and therefore not being deleted before the for loop. They may be affecting the other computation because of considerations needed. For example, running the example as you gave it yields the following times on my system (g++ 4.2.1 with O2 on Mac OSX):

CL:   0.012s
Loop: 14.447s
Both: 14.874s

But putting the OpenCL code into its own anonymous scope, therefore automatically calling the destructors before the loops seems to get rid of the problem. Using the code:

#include <iostream>
#include <cstdio>
#include <vector>
#include "cl.hpp"

int main(int argc, char* argv[])
{

    {
    cl_int status;
    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);
    std::vector<cl::Device> devices;
    platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
    cl::Context context(devices);
    cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
    status = queue.finish();
    printf("Status: %d\n", status);
    }

    int ch;
    int b = 0;
    int sum = 0;
    FILE* f1;
    f1 = fopen(argv[1], "r");
    while((ch = fgetc(f1)) != EOF)
    {
        sum += ch;
        b++;
        if(b % 1000000 == 0)
            printf("Char %d read\n", b);
    }
    printf("Sum: %d\n", sum);
}

I get the timings:

CL:   0.012s
Loop: 14.635s
Both: 14.648s

Which seems to add linearly. The effect is pretty small compared to other effects on the system, such as CPU load from other processes, but it seems to be gone when adding the anonymous scope. I'll do some profiling and add it as an edit if it produces anything of interest.

like image 23
sabreitweiser Avatar answered Sep 28 '22 07:09

sabreitweiser