Testing parallel_for_ performance in OpenCV

Question

I tested parallel_for_ in OpenCV by comparing with the normal operation for just simple array summation and multiplication.

I have array of 100 integers and split into 10 each and run using parallel_for_.

Then I also have normal 0 to 99 operation for summation and multiuplication.

Then I measured the elapsed time and normal operation is faster than parallel_for_ operation.

My CPU is Intel(R) Core(TM) i7-2600 Quard Core CPU. parallel_for_ operation took 0.002sec (took 2 clock cycles) for summation and 0.003sec (took 3 clock cycles) for multiplication.

But normal operation took 0.0000sec (less than one click cycle) for both summation and multiplication. What am I missing? My code is as follow.

TEST Class

#include <opencv2\core\internal.hpp>
#include <opencv2\core\core.hpp>
#include <tbb	bb.h>
using namespace tbb;
using namespace cv;

template <class type>
class Parallel_clipBufferValues:public cv::ParallelLoopBody
{
   private:
       type *buffertoClip;
       type maxSegment;

       char typeOperation;//m = mul, s = summation
       static double total;
   public:
       Parallel_clipBufferValues(){ParallelLoopBody::ParallelLoopBody();};
       Parallel_clipBufferValues(type *buffertoprocess, const type max, const char op): buffertoClip(buffertoprocess), maxSegment(max), typeOperation(op){ 
           if(typeOperation == 's')
                total = 0; 
           else if(typeOperation == 'm')
                total = 1; 
       }
       ~Parallel_clipBufferValues(){ParallelLoopBody::~ParallelLoopBody();};

       virtual void operator()(const cv::Range &r) const{
           double tot = 0;        
           type *inputOutputBufferPTR = buffertoClip+(r.start*maxSegment);
           for(int i = 0; i < 10; ++i)
           {
               if(typeOperation == 's')
                  total += *(inputOutputBufferPTR+i);
               else if(typeOperation == 'm')
                  total *= *(inputOutputBufferPTR+i);
           }

       }

       static double getTotal(){return total;}

       void normalOperation(){
           //int iteration = sizeof(buffertoClip)/sizeof(type);
           if(typeOperation == 'm')
           {
               for(int i = 0; i < 100; ++i)
               {
                  total *= buffertoClip[i];
               }
           }
           else if(typeOperation == 's')
           {
               for(int i = 0; i < 100; ++i)
               {
                  total += buffertoClip[i];
               }
           }
       }

};

MAIN

    #include "stdafx.h"
    #include "TestClass.h"
    #include <ctime>

    double Parallel_clipBufferValues<int>::total;
    int _tmain(int argc, _TCHAR* argv[])
    {
        const int SIZE=100;
        int myTab[SIZE];
        double totalSum_by_parallel;
        double totalSun_by_normaloperation;
        double elapsed_secs_parallel;
        double elapsed_secs_normal;
        for(int i = 1; i <= SIZE; i++)
        {
            myTab[i-1] = i;
        }
        int maxSeg =10;
        clock_t begin_parallel = clock();
        cv::parallel_for_(cv::Range(0,maxSeg), Parallel_clipBufferValues<int>(myTab, maxSeg, 'm'));
        totalSum_by_parallel = Parallel_clipBufferValues<int>::getTotal();
        clock_t end_parallel = clock();
        elapsed_secs_parallel = double(end_parallel - begin_parallel) / CLOCKS_PER_SEC;

        clock_t begin_normal = clock();
        Parallel_clipBufferValues<int> norm_op(myTab, maxSeg, 'm');
        norm_op.normalOperation();
        totalSun_by_normaloperation = norm_op.getTotal();
        clock_t end_normal = clock();
        elapsed_secs_normal = double(end_normal - begin_normal) / CLOCKS_PER_SEC;
        return 0;
    }

Adriano Repetti · Accepted Answer

Let me do some considerations:

Accuracy

clock() function is not accurate at all. Its tick is roughly 1 / CLOCKS_PER_SEC but how often it's updated and if it's uniform or not it's system and implementation dependent. See this post for more details about that.

Better alternatives to measure time:

This post for Windows.
This article for *nix.

Trials and Test Environment

Measures are always affected by errors. Performance measurement for your code is affected (short list, there is much more than that) by other programs, cache, operating system jobs, scheduling and user activity. To have a better measure you have to repeat it many times (let's say 1000 or more) then calculate average. Moreover you should prepare your test environment to be as clean as possible.

More details about tests on these posts:

How do I write a correct micro-benchmark in Java?
NAS Parallel Benchmarks
Visual C++ 11 Beta Benchmark of Parallel Loops (for code examples)
Great articles from our Eric Lippert about benchmarking (it's about C# but most of them applies directly to any bechmark): C# Performance Benchmark Mistakes (part II).

Overhead and Scalability

In your case overhead for parallel execution (and your test code structure) is much higher that loop body itself. In this case it's not productive to make an algorithm parallel. Parallel execution must always be evaluated in a specific scenario, measured and compared. It's not kind of magic medicine to speed up everything. Take a look to this article about How to Quantify Scalability.

Just for example if you have to sum/multiply 100 numbers it's better to use SIMD instructions (even better within an unrolled loop).

Measure It!

Try to make your loop body empty (or to execute a single NOP operation or volatile write so it won't be optimized away). You'll roughly measure overhead. Now compare it with your results.

Notes About This Test

IMO this kind of test is pretty useless. You can't compare, in a generic way, serial or parallel execution. It's something you should always check against a specific situation (in real world many things will play, synchronization for example).

Imagine: you make your loop body really "heavy" and you'll see a big speed up with parallel execution. Now you make your real program parallel and you see performance is worse. Why? Because parallel execution is slowed down by locks, by cache problems or serial access to a shared resource.

Test itself is meaningless unless you're testing your specific code in your specific situation (because too many factors will play and you just can't ignore them). What it means? Well that you can compare only what you tested...if your program performs total *= buffertoClip[i]; then your results are reliable. If your real program does something else then you have to repeat tests with that something else.

Testing parallel_for_ performance in OpenCV

Tags:

c++

performance

opencv

parallel-processing

batuman

1 Answers

Accuracy

Trials and Test Environment

Overhead and Scalability

Measure It!

Notes About This Test

Adriano Repetti

Recent Activity

Donate For Us

Testing parallel_for_ performance in OpenCV

Tags:

c++

performance

opencv

parallel-processing

batuman

1 Answers

Accuracy

Trials and Test Environment

Overhead and Scalability

Measure It!

Notes About This Test

Adriano Repetti

Related questions

Recent Activity

Donate For Us