Consider a bit vector of N
bits in it (N
is large) and an array of M
numbers (M
is moderate, usually much smaller than N
), each in range 0..N-1
indicating which bit of the vector must be set to 1
. The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i
, where 256 bits are packed into each __m256i
structure.
How can this work be split efficiently accross multiple threads?
Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.
@IraBaxter posted an interesting but flawed idea which can be made to work (at significant cost). I suspect @BeeOnRope's idea of partial-sort / partitioning the M array will perform better (especially for CPUs with large private caches which can keep parts of N hot). I'll summarize the modified version of Ira's idea that I described in comments on his deleted answer. (That answer has some suggestions about how big N has to be before it's worth multi-threading.)
Each writer thread gets a chunk of M with no sorting/partitioning.
The idea is that conflicts are very rare because N is large compared to the number of stores that can be in flight at once. Since setting a bit is idempotent, so we can handle conflicts (where two threads want to set different bits in the same byte) by checking the value in memory to make sure it really does have the bit set that we want after a RMW operation like or [N + rdi], al
(with no lock
prefix).
E.g. thread 1 tried to store 0x1
and stepped on thread 2's store of 0x2
. Thread 2 must notice and retry the read-modify-write (probably with lock or
to keep it simple and make multiple retries not possible) to end up with 0x3
in the conflict byte.
We need an mfence
instruction before the read-back. Otherwise store-forwarding will give us the value we we just wrote before other threads see our store. In other words, a thread can observe its own stores earlier than they appear in the global order. x86 does have a Total Order for stores, but not for loads. Thus, we need mfence
to prevent StoreLoad reordering. (Intel's "Loads Are not Reordered with Older Stores to the Same Location" guarantee is not as useful as it sounds: store/reload isn't a memory barrier; they're just talking about out-of-order execution preserving program-order semantics.)
mfence
is expensive, but the trick that makes this better than just using lock or [N+rdi], al
is that we can batch operations. e.g. do 32 or
instructions and then 32 read-back. It's a tradeoff between mfence
overhead per operation vs. increased chance of false-sharing (reading back cache lines that had already been invalidated by another CPU claiming them).
Instead of an actual mfence
instruction, we can do the last or
of a group as a lock or
. This is better for throughput on both AMD and Intel. For example, according to Agner Fog's tables, mfence
has one per 33c throughput on Haswell/Skylake, where lock add
(same performance as or
) has 18c or 19c throughput. Or for Ryzen, ~70c (mfence
) vs. ~17c (lock add
).
If we keep the amount of operations per fence very low, the array index (m[i]/8
) + mask (1<<(m[i] & 7)
) can be kept in registers for all the operations. This probably isn't worth it; fences are too expensive to do as often as every 6 or
operations. Using the bts
and bt
bit-string instructions would mean we could keep more indices in registers (because no shift-result is needed), but probably not worth it because they're slow.
Using vector registers to hold indices might be a good idea, to avoid having to reload them from memory after the barrier. We want the load addresses to be ready as soon as the read-back load uops can execute (because they're waiting for the last store before the barrier to commit to L1D and become globally visible).
Using single-byte read-modify-write makes actual conflicts as unlikely as possible. Each write of a byte only does a non-atomic RMW on 7 neighbouring bytes. Performance still suffers from false-sharing when two threads modify bytes in the same 64B cache-line, but at least we avoid having to actually redo as many or
operations. 32-bit element size would make some things more efficient (like using xor eax,eax
/ bts eax, reg
to generate 1<<(m[i] & 31)
with only 2 uops, or 1 for BMI2 shlx eax, r10d, reg
(where r10d=1
).)
Avoid the bit-string instructions like bts [N], eax
: it has worse throughput than doing the indexing and mask calculation for or [N + rax], dl
. This is the perfect use-case for it (except that we don't care about the old value of the bit in memory, we just want to set it), but still its CISC baggage is too much.
In C, a function might look something like
/// UGLY HACKS AHEAD, for testing only.
// #include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void set_bits( volatile uint8_t * restrict N, const unsigned *restrict M, size_t len)
{
const int batchsize = 32;
// FIXME: loop bounds should be len-batchsize or something.
for (int i = 0 ; i < len ; i+=batchsize ) {
for (int j = 0 ; j<batchsize-1 ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
N[idx] |= mask;
}
// do the last operation of the batch with a lock prefix as a memory barrier.
// seq_cst RMW is probably a full barrier on non-x86 architectures, too.
unsigned idx = M[i+batchsize-1];
unsigned mask = 1U << (idx&7);
idx >>= 3;
__atomic_fetch_or(&N[idx], mask, __ATOMIC_SEQ_CST);
// _mm_mfence();
// TODO: cache `M[]` in vector registers
for (int j = 0 ; j<batchsize ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
if (! (N[idx] & mask)) {
__atomic_fetch_or(&N[idx], mask, __ATOMIC_RELAXED);
}
}
}
}
This compiles to approximately what we want with gcc and clang. The asm (Godbolt) could be more efficient in several ways, but might be interesting to try this. This is not safe: I just hacked this together in C to get the asm I wanted for this stand-alone function, without inlining into a caller or anything. __atomic_fetch_or
is not a proper compiler barrier for non-atomic variables the way asm("":::"memory")
is. (At least the C11 stdatomic
version isn't.) I should probably have used the legacy __sync_fetch_and_or
, which is a full barrier for all memory operations.
It uses GNU C atomic builtins to do atomic RMW operations where desired on variables that aren't atomic_uint8_t
. Running this function from multiple threads at once would be C11 UB, but we only need it to work on x86. I used volatile
to get the asynchronous-modification-allowed part of atomic
without forcing N[idx] |= mask;
to be atomic. The idea is to make sure that the read-back checks don't optimize away.
I use __atomic_fetch_or
as a memory barrier because I know it will be on x86. With seq_cst, it probably will be on other ISAs, too, but this is all a big hack.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With