Is there a way to update a maximum from multiple threads using atomic operations?
Illustrative example:
std::vector<float> coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
#pragma omp critical (coord_max_update)
coord_max[j] = std::max(coord_max[j], x);
}
In the above case, the critical section synchronizes access to the entire vector, whereas we only need to synchronize access to each of the values independently.
Following a suggestion in a comment, I found a solution that does not require locking and instead uses the compare-and-exchange functionality found in std::atomic / boost::atomic. I am limited to C++03 so I would use boost::atomic in this case.
BOOST_STATIC_ASSERT(sizeof(int) == sizeof(float));
union FloatPun { float f; int i; };
std::vector< boost::atomic<int> > coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i);
FloatPun x, maxval;
x.f = compute_value(j, i);
maxval.i = coord_max[j].load(boost::memory_order_relaxed);
do {
if (maxval.f >= x.f) break;
} while (!coord_max[j].compare_exchange_weak(maxval.i, x.i,
boost::memory_order_relaxed));
}
There is some boilerplate involved in putting float values in ints, since it seems that atomic floats are not lock-free. I am not 100% use about the memory order, but the least restrictive level 'relaxed' seems to be OK, since non-atomic memory is not involved.
Not sure about the syntax, but algorithmically, you have three choices:
Lock down the entire vector to guarantee atomic access (which is what you are currently doing).
Lock down individual elements, so that every element can be updated independent of others. Pros: maximum parallelism; Cons: lots of locks required!
Something in-between! Conceptually think of partitioning your vector into 16 (or 32/64/...) "banks" as follows: bank0 consists of vector elements 0, 16, 32, 48, 64, ... bank1 consists of vector elements 1, 17, 33, 49, 65, ... bank2 consists of vector elements 2, 18, 34, 50, 66, ... ... Now, use 16 explicit locks before you access the element and you can have upto 16-way parallelism. To access element n, acquire lock (n%16), finish the access, then release the same lock.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With