Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C: Accessing lookup tables faster?

I have a piece of code that traces 4 sines at a time.

My original code was making roughly 12000 sin() function calls per frame and was running at 30 fps.

I tried optimizing it by generating lookup tables. I ended up with 16 different lookup tables. I declared and load them in a separate header file at the top of my program. Each table is declared like so:

static const float d4_lookup[800] {...};

Now, with this new method I actually lost fps?! I'm running at 20 fps now instead of 30. Each frame now only has to do 8 sin / cos calls and 19200 lookup calls vs 12000 sin() calls. I compile using gcc with -O3 flag on. At the moment, the lookup tables are included at the top and are part of the global scope of the program.

I assume I'm not loading them in the right memory or something to that effect. How can I speed up the lookup time?

** EDIT 1 **

As requested, here's the function that uses the lookup calls, it is called once per frame:

void
update_sines(void)
{
    static float c1_sin, c1_cos;
    static float c2_sin, c2_cos;
    static float c3_sin, c3_cos;
    static float c4_sin, c4_cos;

    clock_gettime(CLOCK_MONOTONIC, &spec);
    s = spec.tv_sec;
    ms = spec.tv_nsec * 0.0000001;
    etime = concatenate((long)s, ms);

    c1_sin = sinf(etime * 0.00525);
    c1_cos = cosf(etime * 0.00525);
    c2_sin = sinf(etime * 0.007326);
    c2_cos = cosf(etime * 0.007326);
    c3_sin = sinf(etime * 0.0046);
    c3_cos = cosf(etime * 0.0046);
    c4_sin = sinf(etime * 0.007992);
    c4_cos = cosf(etime * 0.007992);

    int k;
    for (k = 0; k < 800; ++k)
    {       
        sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
        sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
        sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
        sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;
    }

}

** UPDATE **

For anyone reading this thread, I gave up on this problem. I tried using OpenCL kernels, structs, SIMD instructions as well as all the solutions shown here. In the end the original code that computed the sinf() 12800 per frame worked faster than the lookup tables since the lookup tables didn't fit into the cache. Yet it was still only doing 30 fps. It just had too much going on to keep up with my 60fps expectations. I've decided to take a different direction. Thanks to everyone who contributed to this thread. Most of these solutions would probably work to get some half decent speed improvements but nothing like the 200% speed up I needed here to have the lookup tables work the way I wanted.

like image 532
ReX357 Avatar asked Jan 03 '14 09:01

ReX357


People also ask

Are lookup tables faster?

You approximate the value of the ideal function at a point by interpolating between the two breakpoints closest to the point. Because table lookups and simple estimations can be faster than mathematical function evaluations, using lookup table blocks often result in speed gains when simulating a model.

What is LUT in C?

In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation.

What are advantages and disadvantages of generating a signal from a lookup table?

The primary advantage of lookup tables is their speed. Simply getting a number from a list is much faster than calculating the number with an algorithm or using a trigonometric function. The primary disadvantage of lookup tables is their memory usage.

How are lookup tables defined in C?

So what is a lookup table? Well a lookup table is simply an initialized array that contains precalculated information. They are typically used to avoid performing complex (and hence time consuming) calculations.


3 Answers

Sometimes it's hard to know what's slowing you down, but potentially you are going to ruin your cache hits, you could try a lookup of a struct

typedef struct 
{
  float bx1_sin;
  float bx2_sin;
  float bx3_sin;
  float bx4_sin;
  float bx1_cos;
 etc etc
 including  sine1,2,3,4 as well

} lookup_table

then

lookup_table  lookup[800]

now everything at the kth lookup will be in the same small chunk of memory.

also, if you use a macro that takes k as a parameter to do do the contents of the loop lets say SINE_CALC(k), or an inline function...

you can do

for (k = 0; k < 800; ++k)
{
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
}

if you do a macro, make sure the k++ is outside the macro call like shown

like image 133
Keith Nicholas Avatar answered Oct 19 '22 23:10

Keith Nicholas


Try unrolling your loops like this:

for (k = 0; k < 800; ++k)
{       
    sine1[k] = a1_lookup[k];
    sine2[k] = a2_lookup[k];
    sine3[k] = a3_lookup[k];
    sine4[k] = a4_lookup[k];
}
for (k = 0; k < 800; ++k)
{       
    sine1[k] *= ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k]));
    sine2[k] *= ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k]));
    sine3[k] *= ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k]));
    sine4[k] *= ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k]));
}
for (k = 0; k < 800; ++k)
{       
    sine1[k] += d1_lookup[k];
    sine2[k] += d2_lookup[k] + 50;
    sine3[k] += d3_lookup[k];
    sine4[k] += d4_lookup[k] + 50;
}

By accessing fewer lookup tables in each loop, you should be able to stay in the cache. The middle loop could be split up as well, but you'll need to create an intermediate table for one of the sub-expressions.

like image 38
Barmar Avatar answered Oct 19 '22 22:10

Barmar


Intel processors can predict serial access (and perform prefetch) for up to 4 arrays both for forward and backward traverse. At least this was true in Core 2 Duo days. Split your for in:

for (k = 0; k < 800; ++k)
    sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
for (k = 0; k < 800; ++k)
    sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
for (k = 0; k < 800; ++k)
    sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
for (k = 0; k < 800; ++k)
    sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;

I guess you have more cache load than benchmarks in other answers so this does matters. I recommend you not to unroll loops, compilers do it well.

like image 1
Aleksandr Pakhomov Avatar answered Oct 19 '22 22:10

Aleksandr Pakhomov