I have a circuit where at each clock cycle, N 32-bit inputs are present to be computed on. I have a binary operation that takes two 32-bit inputs and yields one 32-bit output. This operation is associative and I would like to apply it to the entire N 32-bit inputs in order to yield one 32-bit output. Currently, I am achieving this by implementing a pipelined binary tree of operations.
For a concrete example, supposing N = 4 and I have inputs {a,b,c,d} then I would do the following:
a op b => reg1
c op d => reg2
reg1 op reg2 => result
When a stage in the tree is not divisible by 2, I insert a dummy operation that takes only 1 input an yields 1 output with the same latency.
The issue that I have is that I am concerned with a few sizes of N inputs {9, 25, 49, 81, 121}. The largest size of N, 121, requires 110% of luts in my FPGA fabric while all other sizes fit easily. The tree of these binary operations is by far the largest consumer of luts in my design.
I am aware that I could reduce the utilization of luts by nearly half by halving the number of op circuits residing on my board and multiplexing their inputs. Unfortunately, this would mean only getting a result every other clock cycle and halving the bandwidth.
Because a full tree requires only ~10% more resources than the board offers, a 50% reduction in bandwidth seems like too significant of a hit. Is there an architecture where a I can make the trade off of a more fine grained reduction in size for a fine grained reduction in bandwidth?
Have you considered using 3, 4, 5, or 6 input operators? LUTs in current Altera and Xilinx FPGAs are 6 input. If your operators are really simple ( ie. AND, OR, XOR) you may be wasting a lot of LUTs in your current implementation by only using a 2 input function instead of a 6 input function ( assuming that you register each stage of the pipeline). This would turn your binary tree into a k-ary tree of operations. With your numbers would waste many LUTs depending on the choice of N.
Here are the choices of K that would make sense:
N | K | LUTs/bit | LUTs | LUTs/bit @k=2 | LUTs @k=2
-----|---|----------|------|---------------|-----------
9 | 3 | 4 | 128 | 11 | 352
25 | 5 | 6 | 192 | 25 | 800
49 | 5 | 13 | 416 | 50 | 1600
81 | 6 | 18 | 576 | 75 | 2400
121 | 6 | 26 | 832 | 123 | 3936
Ok, I took your example with 9 inputs and tried to solve your binary tree problem with 75% of the binary operators.
I name the inputs a,b,c,d,e,f,g,h,i
. To distinguish the input values in time, I'll add a tick after a.
a' = a at time 1
a''' = a at time 3
(ab)' = result of a' and b'
9 inputs require 8 binary operations. I restrict my processing scheme to 6 operators, so it uses 75% of the needed operators. Each line in the following schema represents one operator.
time 1 time 2 time 3 time 4 | time 1+4
a'b' (abcd)'(efgh)' (ab)''(cd)'' e'''f''' | a'b'
c'd' (abcdefgh)'i' (ef)''(gh)'' g'''h''' | c'd'
e'f' a''b'' (abcd)''(efgh)'' (ab)'''(cd)''' | e'f'
g'h' c''d'' (abcdefgh)''i'' (ef)'''(gh)''' | g'h'
(ab)'(cd)' e''f'' a'''b''' (abcd)'''(efgh)''' | (ab)'(cd)'
(ef)'(gh)' g''h'' c'''d''' (abcdefgh)'''i''' | (ef)'(gh)'
After 4 cycles the schema repeats. So the processing of 3 input sets requires 4 cycles:
=> 33 % overhead.
This schema can be sorted, so that every line processes only 2 different inputs:
=> the input can be muxed by a 2:1 multiplexer.
time 1 time 2 time 3 time 4 | time 1+4
a'b' a''b'' a'''b''' (ab)'''(cd)''' | a'b'
c'd' c''d'' c'''d''' (ef)'''(gh)''' | c'd'
e'f' e''f'' (ab)''(cd)'' e'''f''' | e'f'
g'h' g''h'' (ef)''(gh)'' g'''h''' | g'h'
(ab)'(cd)' (abcd)'(efgh)' (abcd)''(efgh)'' (abcd)'''(efgh)''' | (ab)'(cd)'
(ef)'(gh)' (abcdefgh)'i' (abcdefgh)''i'' (abcdefgh)'''i''' | (ef)'(gh)'
If I made no mistakes, this should be the operation network:
(clickable)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With