Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can GCC be coerced to generate efficient constructors for memory-aligned objects?

I'm optimizing a constructor that is called in one of our app's innermost loops. The class in question is about 100 bytes wide, consists of a bunch of ints, floats, bools, and trivial structs, and should be trivially copyable (it has a nontrivial default constructor, but no destructor or virtual functions). It is constructed often enough that every nanosecond of time spent in this ctor works out to around $6,000 of extra server hardware we need to buy.

However, I find that GCC is not emitting very efficient code for this constructor (even with -O3 -march etc set). GCC's implementation of the constructor, filling out default values via an initializer list, takes about 34ns to run. If instead of this default constructor I use a hand-written function that writes directly to the object's memory space with a variety of SIMD intrinsics and pointer math, construction takes about 8ns.

Can I get GCC to emit an efficient constructor for such objects when I __attribute__ them to be memory-aligned on SIMD boundaries? Or must I resort to old-school techniques like writing my own memory initializers in assembly?

This object is only ever constructed as a local on the stack, so any new/malloc overhead doesn't apply.

Context:

This class is used by constructing it on the stack as a local variable, selectively writing a few fields with non-default values, and then passing it (by reference) to a function, which passes its reference to another and so on.

struct Trivial {   float x,y,z;   Trivial () : x(0), y(0), z(0) {}; };  struct Frobozz {    int na,nb,nc,nd;    bool ba,bb,bc;    char ca,cb,cc;    float fa,fb;    Trivial va, vb; // in the real class there's several different kinds of these    // and so on    Frobozz() : na(0), nb(1), nc(-1), nd(0),                ba(false), bb(true), bc(false),                ca('a'), cb('b'), cc('c'),                fa(-1), fb(1.0) // etc     {} } __attribute__(( aligned(16) ));  // a pointer to a func that takes the struct by reference typedef int (*FrobozzSink_t)( Frobozz& );  // example of how a function might construct one of the param objects and send it // to a sink. Imagine this is one of thousands of event sources: int OversimplifiedExample( int a, float b ) {    Frobozz params;     params.na = a; params.fb = b; // other fields use their default values    FrobozzSink_t funcptr = AssumeAConstantTimeOperationHere();    return (*funcptr)(params); } 

The optimal constructor here would work by copying from a static "template" instance into the freshly constructed instance, ideally using SIMD operators to work 16 bytes at a time. Instead GCC does exactly the wrong thing for OversimplifiedExample() — a series of immediate mov ops to fill out the struct byte-by-byte.

// from objdump -dS int OversimplifiedExample( int a, float b ) {      a42:55                   push   %ebp      a43:89 e5                mov    %esp,%ebp      a45:53                   push   %ebx      a46:e8 00 00 00 00       call   a4b <_Z21OversimplifiedExampleif+0xb>      a4b:5b                   pop    %ebx      a4c:81 c3 03 00 00 00    add    $0x3,%ebx      a52:83 ec 54             sub    $0x54,%esp      // calling the 'Trivial()' constructors which move zero, word by word...      a55:89 45 e0             mov    %eax,-0x20(%ebp)      a58:89 45 e4             mov    %eax,-0x1c(%ebp)      a5b:89 45 e8             mov    %eax,-0x18(%ebp)      a5e:89 45 ec             mov    %eax,-0x14(%ebp)      a61:89 45 f0             mov    %eax,-0x10(%ebp)      a64:89 45 f4             mov    %eax,-0xc(%ebp)      // filling out na/nb/nc/nd..      a67:c7 45 c4 01 00 00 00 movl   $0x1,-0x3c(%ebp)      a71:c7 45 c8 ff ff ff ff movl   $0xffffffff,-0x38(%ebp)      a78:89 45 c0             mov    %eax,-0x40(%ebp)      a7b:c7 45 cc 00 00 00 00 movl   $0x0,-0x34(%ebp)      a82:8b 45 0c             mov    0xc(%ebp),%eax      // doing the bools and chars by moving one immediate byte at a time!      a85:c6 45 d0 00          movb   $0x0,-0x30(%ebp)      a89:c6 45 d1 01          movb   $0x1,-0x2f(%ebp)      a8d:c6 45 d2 00          movb   $0x0,-0x2e(%ebp)      a91:c6 45 d3 61          movb   $0x61,-0x2d(%ebp)      a95:c6 45 d4 62          movb   $0x62,-0x2c(%ebp)      a99:c6 45 d5 63          movb   $0x63,-0x2b(%ebp)      // now the floats...      a9d:c7 45 d8 00 00 80 bf movl   $0xbf800000,-0x28(%ebp)      aa4:89 45 dc             mov    %eax,-0x24(%ebp)      // FrobozzSink_t funcptr = GetFrobozz();      aa7:e8 fc ff ff ff       call   aa8 <_Z21OversimplifiedExampleif+0x68>      // return (*funcptr)(params);      aac:8d 55 c0             lea    -0x40(%ebp),%edx      aaf:89 14 24             mov    %edx,(%esp)      ab2:ff d0                call   *%eax      ab4:83 c4 54             add    $0x54,%esp      ab7:5b                   pop    %ebx      ab8:c9                   leave       ab9:c3                   ret    } 

I tried to encourage GCC to construct a single 'default template' of this object, and then bulk-copy it in the default constructor, by doing a bit of trickery with a hidden 'dummy' constructor that made the base exemplar and then having the default just copy it:

struct Frobozz {      int na,nb,nc,nd;      bool ba,bb,bc;      char ca,cb,cc;      float fa,fb;      Trivial va, vb;      inline Frobozz(); private:      // and so on      inline Frobozz( int dummy ) : na(0), /* etc etc */     {} } __attribute__( ( aligned( 16 ) ) );  Frobozz::Frobozz( ) {      const static Frobozz DefaultExemplar( 69105 );      // analogous to copy-on-write idiom      *this = DefaultExemplar;      // or:      // memcpy( this, &DefaultExemplar, sizeof(Frobozz) ); } 

But this generated even slower code than the basic default with initializer list, due to some redundant stack copying.

Finally I resorted to writing an inlined free function to do the *this = DefaultExemplar step, using compiler intrinsics and assumptions about memory alignment to issue pipelined MOVDQA SSE2 opcodes that copy the struct efficiently. This got me the performance I need, but it's icky. I thought my days of writing initializers in assembly were behind me, and I'd really rather just have GCC's optimizer emit the right code in the first place.

Is there some way I can get GCC to generate optimal code for my constructor, some compiler setting or additional __attribute__ I've missed?

This is GCC 4.4 running on Ubuntu. Compiler flags include -m32 -march=core2 -O3 -fno-strict-aliasing -fPIC (among others). Portability is not a consideration, and I'm thoroughly willing to sacrifice standards-compliance for performance here.

Timings were performed by directly reading the time stamp counter with rdtsc, eg measuring a loop of N OversimplifiedExample() calls between samples with due attention to timer resolution and cache and statistical significance and so on.

I've also optimized this by reducing the number of call sites as much as possible, of course, but I'd still like to know how to generally get better ctors out of GCC.

like image 767
Crashworks Avatar asked Jan 17 '12 12:01

Crashworks


People also ask

Does a constructor create an object in memory?

A constructor does not allocate memory for the class object its this pointer refers to, but may allocate storage for more objects than its class object refers to. If memory allocation is required for objects, constructors can explicitly call the new operator.

Are constructors automatically invoked?

Constructor in C++ is a special method that is invoked automatically at the time of object creation. It is used to initialize the data members of new objects generally. The constructor in C++ has the same name as the class or structure.

Do default constructors have parameters?

A default constructor is a constructor that either has no parameters, or if it has parameters, all the parameters have default values. If no user-defined constructor exists for a class A and one is needed, the compiler implicitly declares a default parameterless constructor A::A() .


1 Answers

Here's how I would do it. Don't declare any constructor; instead, declare a fixed Frobozz that contains default values:

const Frobozz DefaultFrobozz =   {   0, 1, -1, 0,        // int na,nb,nc,nd;   false, true, false, // bool ba,bb,bc;   'a', 'b', 'c',      // char ca,cb,cc;   -1, 1.0             // float fa,fb;   } ; 

Then in OversimplifiedExample:

Frobozz params (DefaultFrobozz) ; 

With gcc -O3 (version 4.5.2), the initialisation of params reduces to:

leal    -72(%ebp), %edi movl    $_DefaultFrobozz, %esi movl    $16, %ecx rep movsl 

which is about as good as it gets in a 32-bit environment.

Warning: I tried this with the 64-bit g++ version 4.7.0 20110827 (experimental), and it generated an explicit sequence of 64-bit copies instead of a block move. The processor doesn't allow rep movsq, but I would expect rep movsl to be faster than a sequence of 64-bit loads and stores. Perhaps not. (But the -Os switch -- optimise for space -- does use a rep movsl instruction.) Anyway, try this and let us know what happens.

Edited to add: I was wrong about the processor not allowing rep movsq. Intel's documentation says "The MOVS, MOVSB, MOVSW, and MOVSD instructions can be preceded by the REP prefix", but it seems that this is just a documentation glitch. In any case, if I make Frobozz big enough, then the 64-bit compiler generates rep movsq instructions; so it probably knows what it's doing.

like image 167
TonyK Avatar answered Oct 16 '22 15:10

TonyK