Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCV C++ multithreading speedups

For the following code, here is a bit of context.

Mat img0; // 1280x960 grayscale

--

timer.start();
for (int i = 0; i < img0.rows; i++)
{
    vector<double> v;
    uchar* p = img0.ptr<uchar>(i);
    for (int j = 0; j < img0.cols; ++j)
    {
        v.push_back(p[j]);
    }
}
cout << "Single thread " << timer.end() << endl;

and

timer.start();
concurrency::parallel_for(0, img0.rows, [&img0](int i) {
    vector<double> v;
    uchar* p = img0.ptr<uchar>(i);
    for (int j = 0; j < img0.cols; ++j)
    {
        v.push_back(p[j]);
    }
});
cout << "Multi thread " << timer.end() << endl;

The result:

Single thread 0.0458856
Multi thread 0.0329856

The speedup is hardly noticeable.

My processor is Intel i5 3.10 GHz

RAM 8 GB DDR3

EDIT

I tried also a slightly different approach.

vector<Mat> imgs = split(img0, 2,1); // `split` is my custom function that, in this case, splits `img0` into two images, its left and right half

--

timer.start();
concurrency::parallel_for(0, (int)imgs.size(), [imgs](int i) {
    Mat img = imgs[i];
    vector<double> v;
    for (int row = 0; row < img.rows; row++)
    {
        uchar* p = img.ptr<uchar>(row);
        for (int col = 0; col < img.cols; ++col)
        {
            v.push_back(p[col]);
        }
    }

});
cout << " Multi thread Sectored " << timer.end() << endl;

And I get much better result:

Multi thread Sectored 0.0232881

So, it looks like I was creating 960 threads or something when I ran

parallel_for(0, img0.rows, ...

And that didn't work well.

(I must add that Kenney's comment is correct. Do not put too much relevance to the specific numbers I stated here. When measuring small intervals such as these, there are high variations. But in general, what I wrote in the edit, about splitting the image in half, improved performance in comparison to old approach.)

like image 916
ancajic Avatar asked Nov 08 '22 23:11

ancajic


1 Answers

I think your problem is that you are limited by memory bandwidth. Your second snippet is basically reading from the whole of the image, and that has got to come out of main memory into cache. (Or out of L2 cache into L1 cache).

You need to arrange your code so that all four cores are working on the same bit of memory at once (I presume you are not actually trying to optimize this code - it is just a simple example).

Edit: Insert crucial "not" in last parenthetical remark.

like image 93
Martin Bonner supports Monica Avatar answered Nov 15 '22 11:11

Martin Bonner supports Monica