Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(opencv rc1) What causes Mat multiplication to be 20x slower than per-pixel multiplication?

// 700 ms
cv::Mat in(height,width,CV_8UC1);
in /= 4;

Replaced with

//40 ms
cv::Mat in(height,width,CV_8UC1);
for (int y=0; y < in.rows; ++y)
{
    unsigned char* ptr = in.data + y*in.step1();
    for (int x=0; x < in.cols; ++x)
    {
        ptr[x] /= 4;
    }
}

What can cause such behavior? Is it due to opencv "promoting" Mat with Scalar multiplication to a Mat with Mat multiplication, or is it a specific failed optimization for arm? (NEON is enabled).

like image 820
Boyko Perfanov Avatar asked May 11 '15 11:05

Boyko Perfanov


People also ask

What is matrix multiplication in OpenCV?

Matrix multiplication is where two matrices are multiplied directly. This operation multiplies matrix A of size [a x b] with matrix B of size [b x c] to produce matrix C of size [a x c]. In OpenCV it is achieved using the simple * operator:

How to do element-wise multiplication in OpenCV?

In OpenCV it is achieved using the simple * operator: Element-wise multiplication is where each pixel in the output matrix is formed by multiplying that pixel in matrix A by its corresponding entry in matrix B.

How does the function perform generalized matrix multiplication in Blas?

The function performs generalized matrix multiplication similar to the gemm functions in BLAS level 3: pointer to input matrix or stored in row major order. number of bytes between two consequent rows of matrix or . pointer to input matrix or stored in row major order. number of bytes between two consequent rows of matrix or .

What is matrix multiplication?

Matrix multiplication is where two matrices are multiplied directly. This operation multiplies matrix A of size [a x b] with matrix B of size [b x c] to produce matrix C of size [a x c].


2 Answers

This is a very old issue (I reported it couple of years ago) that many basic operations are taking extra time. Not just division but also addition, abs, etc... I don't know the real reason for that behavior. What is even more weird, is that the operations that supposed to take more time, like addWeighted, are actually very efficient. Try this one:

addWeighted(in, 1.0/4, in, 0, 0, in);

It performs multiple operations per pixel yet it run few times faster than either add function and loop implementation.

Here is my report on bug tracker.

like image 146
Michael Burdinov Avatar answered Oct 10 '22 02:10

Michael Burdinov


Tried the same by measuring cpu time.

int main()
{
    clock_t startTime;
    clock_t endTime;

    int height =1024;
    int width =1024;

    // 700 ms
    cv::Mat in(height,width,CV_8UC1, cv::Scalar(255));
    std::cout << "value: " << (int)in.at<unsigned char>(0,0) << std::endl;

    cv::Mat out(height,width,CV_8UC1);

    startTime = clock();
    out = in/4;
    endTime = clock();
    std::cout << "1: " << (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
    std::cout << "value: " << (int)out.at<unsigned char>(0,0) << std::endl;


    startTime = clock();
    in /= 4;
    endTime = clock();
    std::cout << "2: " <<  (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
    std::cout << "value: " << (int)in.at<unsigned char>(0,0) << std::endl;

    //40 ms
    cv::Mat in2(height,width,CV_8UC1, cv::Scalar(255));

    startTime = clock();
    for (int y=0; y < in2.rows; ++y)
    {
        //unsigned char* ptr = in2.data + y*in2.step1();
        unsigned char* ptr = in2.ptr(y);
        for (int x=0; x < in2.cols; ++x)
        {
            ptr[x] /= 4;
        }
    }
    std::cout << "value: " << (int)in2.at<unsigned char>(0,0) << std::endl;

    endTime = clock();
    std::cout << "3: " <<  (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;


    cv::namedWindow("...");
    cv::waitKey(0);
}

with results:

value: 255
1: 0.016
value: 64
2: 0.016
value: 64
3: 0.003
value: 63

you see that the results differ, probably because mat.divide() does perform floating point division and rounding to next. While you use integer division in your faster version, which is faster but gives a different result.

In addition, there is a saturate_cast in openCV computation, but I guess the bigger computation load difference will be the double precision division.

like image 40
Micka Avatar answered Oct 10 '22 02:10

Micka