What is the granularity of "masked" stores in AVX512?

Question

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes?

In order to prevent store-to-load forwarding stalls, one must match the granularity (size) of a store to the granularity of subsequent loads to the same memory location. Hopefully the question makes sense, I'm no CPU architecture expert.

user2059893 · Accepted Answer

As Iwillnotexist referenced:

If the mask is not all 1 or all 0, loads that depend on the masked store have to wait until the store data is written to the cache. If the mask is all 1 the data can be forwarded from the masked store to the dependent loads. If the mask is all 0 the loads do not depend on the masked store.

So there's no store-to-load forwarding for masked-stores, except for the case when the mask is all ones (behaves like a regular store), or all zeros (trivial). Load after a masked-store generally waits for data to be sent to cache, so it should be pretty expensive.

What is the granularity of "masked" stores in AVX512?

Tags:

performance

cpu-architecture

assembly

intel

avx512

user2059893

1 Answers

user2059893

Recent Activity

Donate For Us

What is the granularity of "masked" stores in AVX512?

Tags:

performance

cpu-architecture

assembly

intel

avx512

user2059893

1 Answers

user2059893

Related questions

Recent Activity

Donate For Us