Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lookup Table vs if-else

Today I read code using a lookup table instead of if-else for clipping two summed uint8 values. The map is i in i={0...255}, and 255 in i={256...511}. I wondered how big the gain of this might be, and tried to find it out, using gprof,

g++ -std=c++0x -pg perfLookup.cpp -O2 -o perfLookup && ./perfLookup && gprof perfLookup |less

with the code attached below. Now without the -O2 flag gprof says that lookup() takes like 45% and ifelse() like 48% of the execution time. With -O2 though it is 56% for lookup() and 43% for ifelse(). But is this benchmark really correct? Perhaps lots of the code is optimized away since dst is never read?

#include <iostream>
#include <cstdint>
#include <vector>

void lookup(std::vector<uint8_t> src, int repeat) {
  uint8_t lookup[511];
  for (int i = 0; i < 256; i++) {
    lookup[i] = i;
  }
  for (int i = 256; i < 512; i++) {
    lookup[i] = 255;
  }

  std::vector<uint8_t> dst(src.size());
  for (int i = 0; i < repeat; i++) {
    for (int i = 0; i < src.size(); i++) {
      dst[i] = lookup[src[i]];
    }
  }

}

void ifelse(std::vector<uint8_t> src, int repeat) {
  std::vector<uint8_t> dst(src.size());
  for (int i = 0; i < repeat; i++) {
    for (int i = 0; i < src.size(); i++) {
      dst[i] = (src[i] > 255) ? 255 : src[i];
    }
  }
}

int main()
{
  int n = 10000;
  std::vector<uint8_t> src(n);
  for (int i = 0; i < src.size(); i++) {
    src[i] = rand() % 510;
  }

  lookup(src, 10000);
  ifelse(src, 10000);
}

Updated code:

#include <iostream>
#include <cstdint>
#include <cstring>
#include <vector>
#include <algorithm>

// g++ -std=c++0x -pg perfLookup.cpp  -O2 -o perfLookup && ./perfLookup && gprof perfLookup |less

std::vector<uint16_t> lookup(std::vector<uint16_t> src, int repeat) {
  uint16_t lookup[511];
  for (int i = 0; i < 256; i++) {
    lookup[i] = i;
  }
  for (int i = 256; i < 511; i++) {
    lookup[i] = 255;
  }

  std::vector<uint16_t> dst(src.size());
  for (int i = 0; i < repeat; i++) {
    for (int k = 0; k < src.size(); k++) {
      dst[k] = lookup[src[k]]; 
    }
  }

  return dst;

}

std::vector<uint16_t> ifelse(std::vector<uint16_t> src, int repeat) {
  std::vector<uint16_t> dst(src.size());
  for (int i = 0; i < repeat; i++) {
    for (int k = 0; k < src.size(); k++) {
      dst[k] = (src[k] > 255) ? 255 : src[k];
    }
  }
  return dst;
}

std::vector<uint16_t> copyv(std::vector<uint16_t> src, int repeat) {
  std::vector<uint16_t> dst(src.size());
  for (int i = 0; i < repeat; i++) {
    dst = src;
    for (int k = 0; k < src.size(); k++) {
      if (dst[k] > 255) {
    dst[k] = 255; 
      }
    }
  }
  return dst;
}

std::vector<uint16_t> copyC(std::vector<uint16_t> src, int repeat)
{
  uint16_t* dst = (uint16_t *) malloc(sizeof(uint16_t) * src.size()); // Alloc array for dst

  for (int i = 0; i < repeat; i++) {
    std::memcpy(dst, &src[0], sizeof(uint16_t) * src.size()); // copy src into array

    for (int k = 0; k < src.size(); k++) {
      if ((dst[k] & 0xFF00) != 0)
    dst[k] = 0x00FF;
    }
  }

  free(dst); 
  return std::vector<uint16_t>(); 
}

int main()
{
  int n = 10000;
  std::vector<uint16_t> src(n);
  for (int i = 0; i < src.size(); i++) {
    src[i] = rand() % 510;
  }
  std::vector<uint16_t> dst;
  dst = lookup(src, 10000);
  dst = ifelse(src, 10000);
  dst = copyv(src,   10000);
}
like image 964
Simon A. Eugster Avatar asked Jan 24 '11 15:01

Simon A. Eugster


2 Answers

Well, since src is declared as std::vector<uint8_t>, src[i] is never larger than 255, which is the highest possible value for a 8-bit unsigned integer.

Therefore, my guess is that the compiler optimizes the check away. What remains is just the boilerplate loop so the benchmark is meaningless.

Provided the check wasn't meaningless (i.e. check against 64 rather than 255), the result of the 'optimization' will presumably be highly machine-dependent. Branch prediction may (depending on the input data) do a good job at reducing the cost of the branch. The lookup table on the other hand needs (again depending on the input-data) random memory access and spoils the cache ...

like image 196
Alexander Gessler Avatar answered Oct 20 '22 13:10

Alexander Gessler


Apart from the thing Alexander has already said:

Lookup tables can improve performance drastically. However, this is offset by the time it takes to create the lookup table in the first place. Usually you would benchmark this separately.

Another thing that has to be kept in mind is that the lookup table requires space in the cache and may therefore lead to cache misses if it’s big. If there are enough cache misses, the if method will be faster than the lookup table.

Finally, gprof is very good to identify bottlenecks. But I wouldn’t use it for benchmarks. Use a timing function instead. gprof uses sampling which may strictly speaking be mapped to time consumed, but is less precise here.

like image 34
Konrad Rudolph Avatar answered Oct 20 '22 14:10

Konrad Rudolph