I'm writing a digital filter, and I need to keep the last X values and sum them all together.
Now there are two possible approaches to this. Either I shift the whole array using memmove to make room for the next value, and have the right indexes to the array as hard-coded values in my summing algorithm.
memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left
Result += Factor[1] * (Fifo[5] + Fifo[7]);
Result += Factor[2] * (Fifo[4] + Fifo[8]);
Result += Factor[3] * (Fifo[3] + Fifo[9]);
Result += Factor[4] * (Fifo[2] + Fifo[10]);
Result += Factor[5] * (Fifo[1] + Fifo[11]);
Result += Factor[6] * (Fifo[0] + Fifo[12]);
Or alternatively, I don't copy any memory, but increment a counter instead, and calculate each index from that using a modulo operation (like a circular buffer).
i++; // Increment the index
Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]);
Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]);
Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]);
Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]);
Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]);
Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]);
Since its an embedded ARM cpu, I was wondering what would be more efficient. Since I assume that the CPU has to move at least one 32-bit value internally to do the modulo operation, could it be that just moving the whole array is just as fast as calculating the right indexes?
If you need to know which is faster, you need to do benchmark. If you want to know why, you need to examine the assembly.
That being said, there is also halfway solution which could be good enough:
Use buffer larger than needed and only do memmove when your buffer is full.
That way you only have to keep track of starting offset, and not have to worry
about the problems that come with circular buffers. You have to use more memory though.
So if you wish to have 5 elements and use buffer for 10 elements, you only have
to do memmove every 5 insertions. (Except the first pass when you can do 10 insertions)
I've done exactly that on a Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).
I found that in my case copying was somewhat slower than using a circular buffer of size 16 and computing the indices using the modulo operator. Note that 16 is a power of two, which makes division very cheap.
I tried several variants and used a port pin for measuring execution time, I recommend you do the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With