Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the granularity of "masked" stores in AVX512?

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as multiple stores of size 4-bytes?

In order to prevent store-to-load forwarding stalls, one must match the granularity (size) of a store to the granularity of subsequent loads to the same memory location. Hopefully the question makes sense, I'm no CPU architecture expert.

like image 860
user2059893 Avatar asked Sep 03 '20 20:09

user2059893


1 Answers

As Iwillnotexist referenced:

If the mask is not all 1 or all 0, loads that depend on the masked store have to wait until the store data is written to the cache. If the mask is all 1 the data can be forwarded from the masked store to the dependent loads. If the mask is all 0 the loads do not depend on the masked store.

So there's no store-to-load forwarding for masked-stores, except for the case when the mask is all ones (behaves like a regular store), or all zeros (trivial). Load after a masked-store generally waits for data to be sent to cache, so it should be pretty expensive.

like image 131
user2059893 Avatar answered Sep 27 '22 20:09

user2059893