TL; DR How to safely perfom a single bit update A[n/8] |= (1<<n%8);
for A
being a huge array of char
s (i.e., setting n
's bit of A
true) when computing in parallel using C++11's <thread>
library?
I'm performing a computation that's easy to parallelize. I'm computing elements of a certain subset of the natural numbers, and I wanna find elements that are not in the subset. For this I create a huge array (like A = new char[20l*1024l*1024l*1024l]
, i.e., 20GiB). A n
's bit of this array is true if n
lies in my set.
When doing it in parallel and setting the bits true using A[n/8] |= (1<<n%8);
, I seem to get a small loss of information, supposedly due to concurring work on the same byte of A
(each thread has to first read the byte, update the single bit and write the byte back). How can I get around this? Is there a way how to do this update as an atomic operation?
The code follows. GCC version: g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
. The machine is an 8-core Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 37GB RAM. Compiler options: g++ -std=c++11 -pthread -O3
#include <iostream>
#include <thread>
typedef long long myint; // long long to be sure
const myint max_A = 20ll*1024ll*1024ll; // 20 MiB for testing
//const myint max_A = 20ll*1024ll*1024ll*1024ll; // 20 GiB in the real code
const myint n_threads = 1; // Number of threads
const myint prime = 1543; // Tested prime
char *A;
const myint max_n = 8*max_A;
inline char getA(myint n) { return A[n/8] & (1<<(n%8)); }
inline void setAtrue(myint n) { A[n/8] |= (1<<n%8); }
void run_thread(myint startpoint) {
// Calculate all values of x^2 + 2y^2 + prime*z^2 up to max_n
// We loop through x == startpoint (mod n_threads)
for(myint x = startpoint; 1*x*x < max_n; x+=n_threads)
for(myint y = 0; 1*x*x + 2*y*y < max_n; y++)
for(myint z = 0; 1*x*x + 2*y*y + prime*z*z < max_n; z++)
setAtrue(1*x*x + 2*y*y + prime*z*z);
}
int main() {
myint n;
// Only n_threads-1 threads, as we will use the master thread as well
std::thread T[n_threads-1];
// Initialize the array
A = new char[max_A]();
// Start the threads
for(n = 0; n < n_threads-1; n++) T[n] = std::thread(run_thread, n);
// We use also the master thread
run_thread(n_threads-1);
// Synchronize
for(n = 0; n < n_threads-1; n++) T[n].join();
// Print and count all elements not in the set and n != 0 (mod prime)
myint cnt = 0;
for(n=0; n<max_n; n++) if(( !getA(n) )&&( n%1543 != 0 )) {
std::cout << n << std::endl;
cnt++;
}
std::cout << "cnt = " << cnt << std::endl;
return 0;
}
When n_threads = 1
, I get the correct value cnt = 29289
. When n_threads = 7
, I got cnt = 29314
and cnt = 29321
on two different calls, suggesting some of the bitwise operations on a single byte were concurring.
std::atomic
provides all the facilities that you need here:
std::array<std::atomic<char>, max_A> A;
static_assert(sizeof(A[0]) == 1, "Shall not have memory overhead");
static_assert(std::atomic<char>::is_always_lock_free,
"No software-level locking needed on common platforms");
inline char getA(myint n) { return A[n / 8] & (1 << (n % 8)); }
inline void setAtrue(myint n) { A[n / 8].fetch_or(1 << n % 8); }
The load in getA
is atomic (equivalent to load()
), and std::atomic
even has built-in support for or
ing the stored value with another one (fetch_or
), atomically of course.
When initializing A
, the naive way of for (auto& a : A) a = 0;
would require synchronization after every store, which you can avoid by waiving some thread-safety. std::memory_order_release
only requires that what we write is visible to other threads (but not that other thread's writes are visible to us). And indeed, if you do
// Initialize the array
for (auto& a : A)
a.store(0, std::memory_order_release);
you get the safety you need without any assembly-level synchronization on x86. You could do the reverse for the loads after the threads finish, but that has no added benefit on x86 (it's just a mov
either way).
Demo on the full code: https://godbolt.org/z/nLPlv1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With