C: Accessing lookup tables faster?

Tags:

I have a piece of code that traces 4 sines at a time.

My original code was making roughly 12000 sin() function calls per frame and was running at 30 fps.

I tried optimizing it by generating lookup tables. I ended up with 16 different lookup tables. I declared and load them in a separate header file at the top of my program. Each table is declared like so:

Click to copy

static const float d4_lookup[800] {...};

Now, with this new method I actually lost fps?! I'm running at 20 fps now instead of 30. Each frame now only has to do 8 sin / cos calls and 19200 lookup calls vs 12000 sin() calls. I compile using gcc with -O3 flag on. At the moment, the lookup tables are included at the top and are part of the global scope of the program.

I assume I'm not loading them in the right memory or something to that effect. How can I speed up the lookup time?

** EDIT 1 **

As requested, here's the function that uses the lookup calls, it is called once per frame:

Click to copy

void
update_sines(void)
{
    static float c1_sin, c1_cos;
    static float c2_sin, c2_cos;
    static float c3_sin, c3_cos;
    static float c4_sin, c4_cos;

    clock_gettime(CLOCK_MONOTONIC, &spec);
    s = spec.tv_sec;
    ms = spec.tv_nsec * 0.0000001;
    etime = concatenate((long)s, ms);

    c1_sin = sinf(etime * 0.00525);
    c1_cos = cosf(etime * 0.00525);
    c2_sin = sinf(etime * 0.007326);
    c2_cos = cosf(etime * 0.007326);
    c3_sin = sinf(etime * 0.0046);
    c3_cos = cosf(etime * 0.0046);
    c4_sin = sinf(etime * 0.007992);
    c4_cos = cosf(etime * 0.007992);

    int k;
    for (k = 0; k < 800; ++k)
    {       
        sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
        sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
        sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
        sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;
    }

}

** UPDATE **

For anyone reading this thread, I gave up on this problem. I tried using OpenCL kernels, structs, SIMD instructions as well as all the solutions shown here. In the end the original code that computed the sinf() 12800 per frame worked faster than the lookup tables since the lookup tables didn't fit into the cache. Yet it was still only doing 30 fps. It just had too much going on to keep up with my 60fps expectations. I've decided to take a different direction. Thanks to everyone who contributed to this thread. Most of these solutions would probably work to get some half decent speed improvements but nothing like the 200% speed up I needed here to have the lookup tables work the way I wanted.

532

asked Jan 03 '14 09:01

ReX357

3 Answers

Sometimes it's hard to know what's slowing you down, but potentially you are going to ruin your cache hits, you could try a lookup of a struct

Click to copy

typedef struct 
{
  float bx1_sin;
  float bx2_sin;
  float bx3_sin;
  float bx4_sin;
  float bx1_cos;
 etc etc
 including  sine1,2,3,4 as well

} lookup_table

then

Click to copy

lookup_table  lookup[800]

now everything at the kth lookup will be in the same small chunk of memory.

also, if you use a macro that takes k as a parameter to do do the contents of the loop lets say SINE_CALC(k), or an inline function...

you can do

Click to copy

for (k = 0; k < 800; ++k)
{
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
}

if you do a macro, make sure the k++ is outside the macro call like shown

133

answered Oct 19 '22 23:10

Keith Nicholas

Try unrolling your loops like this:

Click to copy

for (k = 0; k < 800; ++k)
{       
    sine1[k] = a1_lookup[k];
    sine2[k] = a2_lookup[k];
    sine3[k] = a3_lookup[k];
    sine4[k] = a4_lookup[k];
}
for (k = 0; k < 800; ++k)
{       
    sine1[k] *= ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k]));
    sine2[k] *= ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k]));
    sine3[k] *= ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k]));
    sine4[k] *= ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k]));
}
for (k = 0; k < 800; ++k)
{       
    sine1[k] += d1_lookup[k];
    sine2[k] += d2_lookup[k] + 50;
    sine3[k] += d3_lookup[k];
    sine4[k] += d4_lookup[k] + 50;
}

By accessing fewer lookup tables in each loop, you should be able to stay in the cache. The middle loop could be split up as well, but you'll need to create an intermediate table for one of the sub-expressions.

answered Oct 19 '22 22:10

Barmar

Intel processors can predict serial access (and perform prefetch) for up to 4 arrays both for forward and backward traverse. At least this was true in Core 2 Duo days. Split your for in:

Click to copy

for (k = 0; k < 800; ++k)
    sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
for (k = 0; k < 800; ++k)
    sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
for (k = 0; k < 800; ++k)
    sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
for (k = 0; k < 800; ++k)
    sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;

I guess you have more cache load than benchmarks in other answers so this does matters. I recommend you not to unroll loops, compilers do it well.

answered Oct 19 '22 22:10

Aleksandr Pakhomov

Related questions
                            
                                Why is the maximal path length allowed for unix-sockets on linux 108?
                            
                                When does underflow occur?
                            
                                Why does calling sbrk(0) twice give a different value?
                            
                                Why does malloc() call mmap() and brk() interchangeably?
                            
                                What is easiest way to calculate an infix expression using C language?
                            
                                Simple circular gesture detection
                            
                                Record RTSP stream with FFmpeg libavformat
                            
                                what's the proper buffer size for 'write' function?
                            
                                C/C++ The purpose of static const local variable [closed]
                            
                                Why does this C program generate SIGPIPE later than expected?
                            
                                Statically linking a C library with a Haskell library
                            
                                prefetching data at L1 and L2
                            
                                In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?
                            
                                Build Apple Silicon binary on Intel machine
                            
                                Last named parameter not function or array?
                            
                                size_t is pointer size in practice
                            
                                Windows equivalent to Linux's readahead syscall?
                            
                                Better algorithm to find average [closed]
                            
                                Ways a program might detect whether NULL is defined with integer or pointer type?
                            
                                gcc memory allocation issue - buffer overflow attack

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C: Accessing lookup tables faster?

Tags:

c

linux

optimization

compiler-optimization

ReX357

People also ask

3 Answers

Keith Nicholas

Barmar

Aleksandr Pakhomov

Recent Activity

Donate For Us