Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Profile C Execution

So, just for fun, and out of curiosity, I wanted to see what executes faster when doing an even-odd check, modulus or bitwise comparisons.

So, I whipped up the following, but I'm not sure that it's behaving correctly, as the difference is so small. I read somewhere online that bitwise should be an order of magnitude faster than modulus checking.

Is it possible that it's getting optimized away? I've just started tinkering with assembly, otherwise I'd attempt to dissect the executable a bit.

EDIT 3: Here is a working test, thanks in a large way to @phonetagger:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

// to reset the global
static const int SEED = 0x2A;

// 5B iterations, each
static const int64_t LOOPS = 5000000000;

int64_t globalVar;

// gotta call something
int64_t doSomething( int64_t input )
{
  return 1 + input;
}

int main(int argc, char *argv[]) 
{
  globalVar = SEED;

  // mod  
  clock_t startMod = clock();

  for( int64_t i=0; i<LOOPS; ++i )
  {    
    if( ( i % globalVar ) == 0 )
    {
      globalVar = doSomething(globalVar);      
    }    
  }

  clock_t endMod = clock();

  double modTime = (double)(endMod - startMod) / CLOCKS_PER_SEC;

  globalVar = SEED;

  // bit
  clock_t startBit = clock();

  for( int64_t j=0; j<LOOPS; ++j )
  {
    if( ( j & globalVar ) == 0 )
    {
      globalVar = doSomething(globalVar);
    }       
  }

  clock_t endBit = clock();

  double bitTime = (double)(endBit - startBit) / CLOCKS_PER_SEC;

  printf("Mod: %lf\n", modTime);
  printf("Bit: %lf\n", bitTime);  
  printf("Dif: %lf\n", ( modTime > bitTime ? modTime-bitTime : bitTime-modTime ));
}

5 billion iterations of each loop, with a global removing compiler optimization yields the following:

Mod: 93.099101
Bit: 16.701401
Dif: 76.397700
like image 986
Josh Avatar asked Jul 19 '12 16:07

Josh


People also ask

What is execution profiling?

The Execution Profiler records timing and execution statistics about instructions for the complete program code. To view the values in the Editor or Disassembly Window, use Show Time or Show Calls from the menu Debug — Execution Profiling.

What is a profiler C?

The C Profiler tool enables the collection and display of execution profile data on C software source code bases of arbitrary size. It is a member of SD's family of Profiler tools.

What does profiling mean in C++?

In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls.

What is C# profiling?

A profiler is a tool that monitors the execution of another application. A common language runtime (CLR) profiler is a dynamic link library (DLL) that consists of functions that receive messages from, and send messages to, the CLR by using the profiling API. The profiler DLL is loaded by the CLR at run time.


2 Answers

gcc foo.c -std=c99 -S -O0 (note, I specifically did -O0) for x86 gave me the same assembly for both loops. Operator strength reduction meant that both ifs used an andl to get the job done (which is faster than a modulo on Intel machines):

First Loop:

.L6:
        movl    72(%esp), %eax
        andl    $1, %eax
        testl   %eax, %eax
        jne     .L5
        call    doNothing
.L5:
        addl    $1, 72(%esp)
.L4:
        movl    LOOPS, %eax
        cmpl    %eax, 72(%esp)
        jl      .L6

Second Loop:

.L9:
        movl    76(%esp), %eax
        andl    $1, %eax
        testl   %eax, %eax
        jne     .L8
        call    doNothing
.L8:
        addl    $1, 76(%esp)
.L7:
        movl    LOOPS, %eax
        cmpl    %eax, 76(%esp)
        jl      .L9

The miniscule difference you see is probably because of the resolution/inaccuracy of clock.

like image 157
ArjunShankar Avatar answered Sep 17 '22 23:09

ArjunShankar


Most compilers will compile both of the following to EXACTLY the same machine instruction(s):

if( ( i % 2 ) == 0 )

if( ( i & 1 ) == 0 )

...even without ANY "optimization" turned on. The reason is that you are MOD-ing and AND-ing with constant values, and a %2 operation is, as any compiler writer should know, functionally equivalent to an &1 operation. In fact, MOD by any power-of-2 has an equivalent AND operation. If you really want to test the difference, you'll need to make the right-hand-side of both operations be variable, and to be absolutely sure the compiler's cleverness isn't thwarting your efforts, you'll need to bury the variables' initializations somewhere that the compiler can't tell at that point what its runtime value will be; i.e. you'll need to pass the values into a GLOBALLY-DECLARED (i.e. not 'static') test function as parameters, in which case the compiler can't trace back to their definition & substitute the variables with constants, because theoretically any external caller could pass any values in for those parameters. Alternatively, you could leave the code in main() and define the variables globally, in which case the compiler can't substitute them with constants because it can't know for sure that another function may have altered the value of the global variables.

Incidentally, this same issue exists for divide operations.... Divisions by constant powers-of-two can be substituted with an equivalent right-shift (>>) operation. The same trick works for multiplication (<<), but the benefits are less (or nonexistant) for multiplications. True division operations just take a long time in hardware, though significant improvements have been made in most modern processors vs. even 15 years ago, division operations still take maybe 80 clock cycles, while a >> operation takes only a single cycle. You're not going to see an "order of magnitude" improvement using bitwise tricks on modern processors, but most compilers will still use those tricks because there is still some noticeable improvement.

EDIT: On some embedded processors (and, unbelievable though it was, the original Sparc desktop/workstation processor versions before v8), there isn't even a divide instruction at all. All true divide & mod operations on such processors must be performed entirely in software, which can be a monstrously expensive operation. In that sort of environment, you surely would see an order of magnitude difference.

like image 31
phonetagger Avatar answered Sep 20 '22 23:09

phonetagger