I want to parallel my search algorithm using openMP, vTree
is a binary search tree, and I want to apply my search algorithm for each of the point set. below is a snippet of my code. the search procedure for two points is totally irrelevant and so can be parallel. though they do need to read a same tree, but once constructed, the tree wouldn't be modified any more. thus it is read-only.
However, the code below shows terrible scalability, on my 32-core platform, only 2x speed up is achieved. is it because that vTree
is read by all threads? if so, how can I further optimize the code?
auto results = vector<vector<Point>>(particleNum);
auto t3 = high_resolution_clock::now();
double radius = 1.6;
#pragma omp parallel for
for (decltype(points.size()) i = 0; i < points.size(); i++)
{
vTree.search(points[i], radius, results[i]);
}
auto t4 = high_resolution_clock::now();
double searchTime = duration_cast<duration<double>>(t4 - t3).count();
the type signature for search
is
void VPTree::search(const Point& p, double radius, vector<Point>& result) const
search result would be put into result
.
My best guess would be that you are cache ping-pong'ing on the result vectors. I would assume that your "search" function uses the passed-in result vector as a place to put points and that you use it throughout the algorithm to insert neighbors as you encounter them in the search tree. Whenever you add a point to that result vector, the internal data of that vector object will be modified. And because all of your result vectors are packed together in contiguous memory, it is likely that different result vectors occupy the same cache lines. So, when the CPU maintains cache coherence, it will constantly lock the relevant cache lines.
The way to solve it is to use an internal, temporary vector that you only assign to the results vector once at the end (which can be done cheaply if you use move semantics). Something like this:
void VPTree::search(const Point& p, double radius, vector<Point>& result) const {
vector<Point> tmp_result;
// ... add results to "tmp_result"
result = std::move(tmp_result);
return;
}
Or, you could also just return the vector by value (which is implicitly using a move):
vector<Point> VPTree::search(const Point& p, double radius) const {
vector<Point> result;
// ... add results to "result"
return result;
}
Welcome to the joyful world of move-semantics and how awesome it can be at solving these types of concurrency / cache-coherence issues.
It is also conceivable that you're experiencing problems related to accessing the same tree from all threads, but since it's all read-only operations, I'm pretty sure that even on a conservative architecture like x86 (and other Intel / AMD CPUs) that this should not pose a significant issue, but I might be wrong (maybe a kind of "over-subscription" problem might be at play, but it's dubious). And other problems might include the fact that OpenMP does incur quite a bit of overhead (spawning threads, synchronizing, etc.) which has to be weighted against the computational cost of the actual operations you are doing within those parallel loops (and it's not always a favorable trade-off). And also, if your VPTree (I imagine stands for "Vantage-point Tree") does not have good locality of references (e.g., you implemented it as a linked-tree), then the performance is going to be terrible whichever way you use it (as I explain here).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With