I am attempting to re-write a raytracer using Streaming SIMD Extensions. My original raytracer used inline assembly and movups instructions to load data into the xmm registers. I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. The primary class affected is vector, which looks something like this:
#include "xmmintrin.h"
union vector {
__m128 simd;
float raw[4];
//some constructors
//a bunch of functions and operators
} __attribute__ ((aligned (16)));
I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. Is there any way to force this alignment? I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. I'm willing to increase the size of my vector data structure, if needed.
IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.
Alignment refers to the arrangement of data in memory, and specifically deals with the issue of accessing data as proper units of information from main memory. First we must conceptualize main memory as a contiguous block of consecutive memory locations. Each location contains a fixed number of bits.
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Alignment matters not only for performance, but also for correctness. Some architectures will fail with an processor trap if the data is not aligned correctly, or access the wrong memory location.
The simplest way is std::aligned_storage
, which takes alignment as a second parameter.
If you don't have it yet, you might want to check Boost's version.
Then you can build your union:
union vector {
__m128 simd;
std::aligned_storage<16, 16> alignment_only;
}
Finally, if it does not work, you can always create your own little class:
template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
Type* operator->() {
return reinterpret_cast<Type const*>(aligned());
}
Type const* operator->() const {
return reinterpret_cast<Type const*>(aligned());
}
Type& operator*() { return *(operator->()); }
Type const& operator*() const { return *(operator->()); }
private:
unsigned char* aligned() {
if (data & ~(Align-1) == data) { return data; }
return (data + Align) & ~(Align-1);
}
unsigned char data[sizeof(Type) + Align - 1];
};
It will allocate a bit more storage than necessary, but this way alignment is guaranteed.
int main(int argc, char* argv[])
{
RawStorage<__m128, 16> simd;
*simd = /* ... */;
return 0;
}
With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.
A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it).
There very well may be better ways of doing it, but here is what I did ...
typedef float v4f_t __attribute__((vector_size (16)));
class Vector {
...
union {
v4f_t simd;
float f[4];
} __attribute__ ((aligned (16)));
...
};
Disassembling my compiled binary showed that it was indeed making use of the movaps instruction.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With