Today I read code using a lookup table instead of if-else for clipping two summed uint8 values. The map is i in i={0...255}
, and 255 in i={256...511}
. I wondered how big the gain of this might be, and tried to find it out, using gprof,
g++ -std=c++0x -pg perfLookup.cpp -O2 -o perfLookup && ./perfLookup && gprof perfLookup |less
with the code attached below. Now without the -O2 flag gprof says that lookup() takes like 45% and ifelse() like 48% of the execution time. With -O2 though it is 56% for lookup() and 43% for ifelse(). But is this benchmark really correct? Perhaps lots of the code is optimized away since dst is never read?
#include <iostream>
#include <cstdint>
#include <vector>
void lookup(std::vector<uint8_t> src, int repeat) {
uint8_t lookup[511];
for (int i = 0; i < 256; i++) {
lookup[i] = i;
}
for (int i = 256; i < 512; i++) {
lookup[i] = 255;
}
std::vector<uint8_t> dst(src.size());
for (int i = 0; i < repeat; i++) {
for (int i = 0; i < src.size(); i++) {
dst[i] = lookup[src[i]];
}
}
}
void ifelse(std::vector<uint8_t> src, int repeat) {
std::vector<uint8_t> dst(src.size());
for (int i = 0; i < repeat; i++) {
for (int i = 0; i < src.size(); i++) {
dst[i] = (src[i] > 255) ? 255 : src[i];
}
}
}
int main()
{
int n = 10000;
std::vector<uint8_t> src(n);
for (int i = 0; i < src.size(); i++) {
src[i] = rand() % 510;
}
lookup(src, 10000);
ifelse(src, 10000);
}
Updated code:
#include <iostream>
#include <cstdint>
#include <cstring>
#include <vector>
#include <algorithm>
// g++ -std=c++0x -pg perfLookup.cpp -O2 -o perfLookup && ./perfLookup && gprof perfLookup |less
std::vector<uint16_t> lookup(std::vector<uint16_t> src, int repeat) {
uint16_t lookup[511];
for (int i = 0; i < 256; i++) {
lookup[i] = i;
}
for (int i = 256; i < 511; i++) {
lookup[i] = 255;
}
std::vector<uint16_t> dst(src.size());
for (int i = 0; i < repeat; i++) {
for (int k = 0; k < src.size(); k++) {
dst[k] = lookup[src[k]];
}
}
return dst;
}
std::vector<uint16_t> ifelse(std::vector<uint16_t> src, int repeat) {
std::vector<uint16_t> dst(src.size());
for (int i = 0; i < repeat; i++) {
for (int k = 0; k < src.size(); k++) {
dst[k] = (src[k] > 255) ? 255 : src[k];
}
}
return dst;
}
std::vector<uint16_t> copyv(std::vector<uint16_t> src, int repeat) {
std::vector<uint16_t> dst(src.size());
for (int i = 0; i < repeat; i++) {
dst = src;
for (int k = 0; k < src.size(); k++) {
if (dst[k] > 255) {
dst[k] = 255;
}
}
}
return dst;
}
std::vector<uint16_t> copyC(std::vector<uint16_t> src, int repeat)
{
uint16_t* dst = (uint16_t *) malloc(sizeof(uint16_t) * src.size()); // Alloc array for dst
for (int i = 0; i < repeat; i++) {
std::memcpy(dst, &src[0], sizeof(uint16_t) * src.size()); // copy src into array
for (int k = 0; k < src.size(); k++) {
if ((dst[k] & 0xFF00) != 0)
dst[k] = 0x00FF;
}
}
free(dst);
return std::vector<uint16_t>();
}
int main()
{
int n = 10000;
std::vector<uint16_t> src(n);
for (int i = 0; i < src.size(); i++) {
src[i] = rand() % 510;
}
std::vector<uint16_t> dst;
dst = lookup(src, 10000);
dst = ifelse(src, 10000);
dst = copyv(src, 10000);
}
Well, since src
is declared as std::vector<uint8_t>
, src[i]
is never larger than 255
, which is the highest possible value for a 8-bit unsigned integer.
Therefore, my guess is that the compiler optimizes the check away. What remains is just the boilerplate loop so the benchmark is meaningless.
Provided the check wasn't meaningless (i.e. check against 64 rather than 255), the result of the 'optimization' will presumably be highly machine-dependent. Branch prediction may (depending on the input data) do a good job at reducing the cost of the branch. The lookup table on the other hand needs (again depending on the input-data) random memory access and spoils the cache ...
Apart from the thing Alexander has already said:
Lookup tables can improve performance drastically. However, this is offset by the time it takes to create the lookup table in the first place. Usually you would benchmark this separately.
Another thing that has to be kept in mind is that the lookup table requires space in the cache and may therefore lead to cache misses if it’s big. If there are enough cache misses, the if
method will be faster than the lookup table.
Finally, gprof
is very good to identify bottlenecks. But I wouldn’t use it for benchmarks. Use a timing function instead. gprof
uses sampling which may strictly speaking be mapped to time consumed, but is less precise here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With