Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Coding Practices which enable the compiler/optimizer to make a faster program

People also ask

What is code compiler optimization techniques?

The code optimization in the synthesis phase is a program transformation technique, which tries to improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so that faster-running machine code will result.

What is optimize in coding?

Code optimization is any method of code modification to improve code quality and efficiency. A program may be optimized so that it becomes a smaller size, consumes less memory, executes more rapidly, or performs fewer input/output operations.


Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:

Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.

Next, profile your code.

Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.

Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.


Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    for (int i=0; i<numFoo, i++)
    {
         barOut.munge(foo1, foo2[i]);
    }
}

the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    Foo barTemp = barOut;
    for (int i=0; i<numFoo, i++)
    {
         barTemp.munge(foo1, foo2[i]);
    }
    barOut = barTemp;
}

It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).


The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:

#define N 1000000;
int matrix[N][N] = { ... };

//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
  for(int j = 0; j < N; j++){
    sum += matrix[i][j];
  }
}

//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
  for(int j = 0; j < N; j++){
    sum += matrix[j][i];
  }
}

Generic Optimizations

Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.

Declare small functions as inline or macros

Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)

Remove dead and redundant code

If the code isn't used or does not contribute to the program's result, get rid of it.

Simplify design of algorithms

I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.

Loop Unrolling

Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.

Edit: provide an example of loop unrolling Before:

unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}

After unrolling:

unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
    sum += *buffer++; // 1
    sum += *buffer++; // 2
    sum += *buffer++; // 3
    sum += *buffer++; // 4
    sum += *buffer++; // 5
    sum += *buffer++; // 6
    sum += *buffer++; // 7
    sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}

In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.

I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.

Reduction of if statements

Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:

bool status = true;
status = status && /* first test */;
status = status && /* second test */;

The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.

Example:

struct Reader_Interface
{
  virtual bool  write(unsigned int value) = 0;
};

struct Rectangle
{
  unsigned int origin_x;
  unsigned int origin_y;
  unsigned int height;
  unsigned int width;

  bool  write(Reader_Interface * p_reader)
  {
    bool status = false;
    if (p_reader)
    {
       status = p_reader->write(origin_x);
       status = status && p_reader->write(origin_y);
       status = status && p_reader->write(height);
       status = status && p_reader->write(width);
    }
    return status;
};

Factor Variable Allocation outside of loops

If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.

Factor constant expressions outside of loops

If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.

I/O in blocks

Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:

static const char  Menu_Text[] = "\n"
    "1) Print\n"
    "2) Insert new customer\n"
    "3) Destroy\n"
    "4) Launch Nasal Demons\n"
    "Enter selection:  ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);

The efficiency of this technique can be visually demonstrated. :-)

Don't use printf family for constant data

Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.

Format to memory, then write

Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.

Declare constant text (string literals) as static const

When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.

Lastly, Code like the compiler would

Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.