What's the difference between the following two lines?
__m128 x = _mm_load_ps((float *) ptr);
__m128 y = _mm_load_pd((double *)ptr);
In other words, why are there so many different _mm_load_xyz
instructions, instead of a generic __m128 _mm_load(const void *)
?
There are different intrinsics because they correspond to different instructions.
There are different load instructions because Intel wants to maintain the freedom to design a processor on which double-precision vectors are backed by a different physical register file than are single-precision vectors or integer vectors, or use different execution units. Any of these might add additional latency if there were not a way to specify that data should be loaded into the appropriate register file or forwarding network.
One way to think about it is that the different instructions do the "same thing", but additionally provide a hint to the processor telling it how the data that is being loaded will be used by future instructions. This may help the processor make sure that the data is in the right place to be used as efficiently as possible, or it may be ignored by the processor.
Note that this isn't just a hypothetical. There exist processors on which using an integer vector load (MOVDQA) to load data that is consumed by a floating-point operation requires more time than using a floating-point load to get data for a floating-point operation (and vice-versa). See the Intel Optimization Manual, or Agner Fog's notes for more detail on the subject. Use the load that matches how you will use the data to avoid the risk of such performance hazards in the future.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With