I tested parallel_for_
in OpenCV by comparing with the normal operation for just simple array summation and multiplication.
I have array of 100 integers and split into 10 each and run using parallel_for_
.
Then I also have normal 0 to 99 operation for summation and multiuplication.
Then I measured the elapsed time and normal operation is faster than parallel_for_
operation.
My CPU is Intel(R) Core(TM) i7-2600 Quard Core CPU.
parallel_for_
operation took 0.002sec (took 2 clock cycles) for summation and 0.003sec (took 3 clock cycles) for multiplication.
But normal operation took 0.0000sec (less than one click cycle) for both summation and multiplication. What am I missing? My code is as follow.
TEST Class
#include <opencv2\core\internal.hpp>
#include <opencv2\core\core.hpp>
#include <tbb\tbb.h>
using namespace tbb;
using namespace cv;
template <class type>
class Parallel_clipBufferValues:public cv::ParallelLoopBody
{
private:
type *buffertoClip;
type maxSegment;
char typeOperation;//m = mul, s = summation
static double total;
public:
Parallel_clipBufferValues(){ParallelLoopBody::ParallelLoopBody();};
Parallel_clipBufferValues(type *buffertoprocess, const type max, const char op): buffertoClip(buffertoprocess), maxSegment(max), typeOperation(op){
if(typeOperation == 's')
total = 0;
else if(typeOperation == 'm')
total = 1;
}
~Parallel_clipBufferValues(){ParallelLoopBody::~ParallelLoopBody();};
virtual void operator()(const cv::Range &r) const{
double tot = 0;
type *inputOutputBufferPTR = buffertoClip+(r.start*maxSegment);
for(int i = 0; i < 10; ++i)
{
if(typeOperation == 's')
total += *(inputOutputBufferPTR+i);
else if(typeOperation == 'm')
total *= *(inputOutputBufferPTR+i);
}
}
static double getTotal(){return total;}
void normalOperation(){
//int iteration = sizeof(buffertoClip)/sizeof(type);
if(typeOperation == 'm')
{
for(int i = 0; i < 100; ++i)
{
total *= buffertoClip[i];
}
}
else if(typeOperation == 's')
{
for(int i = 0; i < 100; ++i)
{
total += buffertoClip[i];
}
}
}
};
MAIN
#include "stdafx.h"
#include "TestClass.h"
#include <ctime>
double Parallel_clipBufferValues<int>::total;
int _tmain(int argc, _TCHAR* argv[])
{
const int SIZE=100;
int myTab[SIZE];
double totalSum_by_parallel;
double totalSun_by_normaloperation;
double elapsed_secs_parallel;
double elapsed_secs_normal;
for(int i = 1; i <= SIZE; i++)
{
myTab[i-1] = i;
}
int maxSeg =10;
clock_t begin_parallel = clock();
cv::parallel_for_(cv::Range(0,maxSeg), Parallel_clipBufferValues<int>(myTab, maxSeg, 'm'));
totalSum_by_parallel = Parallel_clipBufferValues<int>::getTotal();
clock_t end_parallel = clock();
elapsed_secs_parallel = double(end_parallel - begin_parallel) / CLOCKS_PER_SEC;
clock_t begin_normal = clock();
Parallel_clipBufferValues<int> norm_op(myTab, maxSeg, 'm');
norm_op.normalOperation();
totalSun_by_normaloperation = norm_op.getTotal();
clock_t end_normal = clock();
elapsed_secs_normal = double(end_normal - begin_normal) / CLOCKS_PER_SEC;
return 0;
}
Let me do some considerations:
clock()
function is not accurate at all. Its tick is roughly 1 / CLOCKS_PER_SEC
but how often it's updated and if it's uniform or not it's system and implementation dependent. See this post for more details about that.
Better alternatives to measure time:
Measures are always affected by errors. Performance measurement for your code is affected (short list, there is much more than that) by other programs, cache, operating system jobs, scheduling and user activity. To have a better measure you have to repeat it many times (let's say 1000 or more) then calculate average. Moreover you should prepare your test environment to be as clean as possible.
More details about tests on these posts:
In your case overhead for parallel execution (and your test code structure) is much higher that loop body itself. In this case it's not productive to make an algorithm parallel. Parallel execution must always be evaluated in a specific scenario, measured and compared. It's not kind of magic medicine to speed up everything. Take a look to this article about How to Quantify Scalability.
Just for example if you have to sum/multiply 100 numbers it's better to use SIMD instructions (even better within an unrolled loop).
Try to make your loop body empty (or to execute a single NOP
operation or volatile
write so it won't be optimized away). You'll roughly measure overhead. Now compare it with your results.
IMO this kind of test is pretty useless. You can't compare, in a generic way, serial or parallel execution. It's something you should always check against a specific situation (in real world many things will play, synchronization for example).
Imagine: you make your loop body really "heavy" and you'll see a big speed up with parallel execution. Now you make your real program parallel and you see performance is worse. Why? Because parallel execution is slowed down by locks, by cache problems or serial access to a shared resource.
Test itself is meaningless unless you're testing your specific code in your specific situation (because too many factors will play and you just can't ignore them). What it means? Well that you can compare only what you tested...if your program performs total *= buffertoClip[i];
then your results are reliable. If your real program does something else then you have to repeat tests with that something else.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With