I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for
loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel.
The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.
However, when I tried parallelizing the main for
with MS's cuncurrency::parallel_for
I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).
Using Intel's compiler with tbb::parallel_for
gave near exact results.
For comparison, I started out with this algorithm implemented in C#
in which I also used parallel_for
loops and there I encountered near X4 performance gains (I opted for C++
because for this particular task C++
was faster even with a single core).
Any ideas what is preventing my code from parallelizing well?
My code:
template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
//Translates index offset in Bayer image to channel offset in RGB image
int offsets[4];
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
offsets[0] = 2;
offsets[1] = 1;
offsets[2] = 1;
offsets[3] = 0;
break;
...other color spaces
}
memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row%2)*2 + (col%2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.
Vector operations like SSE
/AVX
would not help either - you are not doing any intensive calculations.
Increasing work amount per iteration is also useless - both PPL
and TBB
are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for
:
When worker threads are available,
parallel_for
executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.
What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset
or do it in parallel too.
You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size
of your input and output, measure time
of algorithm, divide size
/time
and compare result with maximal characteristics of your system (for instance, measure with benchmark).
I have made test for Microsoft PPL
, OpenMP
and Native for
, results are (I used 8x of your height):
Native_For 0.21 s
OpenMP_For 0.15 s
Intel_TBB_For 0.15 s
MS_PPL_For 0.15 s
If remove memset
then:
Native_For 0.15 s
OpenMP_For 0.09 s
Intel_TBB_For 0.09 s
MS_PPL_For 0.09 s
As you can see memset
(which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.
FULL SOURCE CODE:
#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>
using namespace boost;
using namespace std;
const auto Width = 3264;
const auto Height = 2540*8;
struct MS_PPL_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
concurrency::parallel_for(first,last,f);
}
};
struct Intel_TBB_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
tbb::parallel_for(first,last,f);
}
};
struct Native_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
for(; first!=last; ++first) f(first);
}
};
struct OpenMP_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
#pragma omp parallel for
for(auto i=first; i<last; ++i) f(i);
}
};
template<typename T>
struct ConvertBayerToRgbImageAsIs
{
const T* BayerChannel;
T* RgbChannel;
template<typename For>
void operator()(For for_)
{
cout << type_name<For>() << "\t";
progress_timer t;
int offsets[] = {2,1,1,0};
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
for_(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row % 2)*2 + (col % 2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
};
int main()
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
for(auto i=0;i!=4;++i)
{
mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
cout << string(16,'_') << endl;
}
}
I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.
An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results
bad good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2
Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[]
and BayerChannel[]
is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.
the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.
// is the memset really necessary?
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
int rowMod = (row & 1) << 1;
for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
{
auto rgbIndex = tripleBayerIndex + offsets[rowMod];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
//unrolled the loop to save col & 1 operation
rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
}
});
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With