We are trying to implement a fixed point nonlinear mathematical function on a FPGA. We want to be able to achieve very low latency (2-4 clock cycles max), have the computation pipelined in such a way that we can receive a new answer every clock cycle (no dropped inputs since they come in every clock cycle), have decent accuracy, AND have reasonable FPGA resource utilization.
We performed the computation using a combination of CORDIC computers and DSP blocks for a pretty good solution, except the CORDIC computers required about 12 clock cycles for good accuracy.
Using a LUT without interpolation would require way too many RAMs as we have 32 bits, so we threw that out.
Our next option was using a look up table with interpolation. The latency was good because we could automatically index the LUT using the upper bits of the input value. The problem with this was that the accuracy wasn't very good in the non-linear sections.
We are now trying to use a LUT with non-uniform spacing between the samples. Basically we sample the function more in the non-linear portions, and sample less when the function looks more linear. This should help out our accuracy problem a lot, but we now face the problem where we can't automatically index the LUT with the upper bits of the input value. We investigated ways of doing a binary search to find the index, but the latency suffered. Resource utilization wasn't great either, because in order to keep getting an answer on the output every clock cycle, we had to replicate our LUTs in different pipelined stages just to handle the binary searching. We tried a few tricks like using dual-port rams, but the latency is still killer.
So we are wondering if anyone has had a similar problem and knows of a good indexing solution, or if there are special/smart ways to sample our function non-uniformly and build the LUT in such a way that indexing can still be computed fairly quickly.
Let's say your function performs
y = f(x)
you could first compress x with a piece-wise linear
z = g(x)
implemented as a small LUT + linear interpolation (just taking into account the few most significant bits of x) such that you dedicate more memory to the interesting regions of your function and less to the almost constant regions.
Then you calculate yas
y = f(x)
= h(z)
= h(g(x))
where h would be a pre-process, "warped" version of your original table.
h(z) = f(g'(x))
and
g'(g(x)) = x
Then you would still use only LUTs on the first few bits plus linear interpolation, but in two stages:
+--------+ +--------+
| LUT #1 | | LUT #2 |
| | | |
-- x -->| g(x) |-- z -->| h(z) |--> y
| | | |
| | | |
+--------+ +--------+
Rough sketch of how this could look like:
y = f(x)
^
| : : ..:..............:
| : : ........ : :
| : : .. : :
| : :.. : :
| : ...: : :
| :........... : : :
|..............: : : :
+-----------------------------------------------------------+->
| | | | |
z = g(x)
^
| : : ...O..............O
| : : .... : :
| : : .... : :
| : ...O... : :
| : .... : : :
| : .... : : :
O..............O... : : :
+-----------------------------------------------------------+->
| | | | |
y = h(z)
^
| : : ..:...:
| : : ......... : :
| : : ....... : :
| : :...... : :
| : ........: : :
| :.................. : : :
|...: : : :
+-----------------------------------------------------------+->
| | | | |
The interesting problem is then of course how to find the optimal g(x) such that the worst-case (or average-case) error is minimized.
But that's something you can perform offline or even analytically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With