Is it possible to reduce the space requirement of a tree of binary operations on an FPGA at the expense of bandwidth by a factor of less than 2?

Question

I have a circuit where at each clock cycle, N 32-bit inputs are present to be computed on. I have a binary operation that takes two 32-bit inputs and yields one 32-bit output. This operation is associative and I would like to apply it to the entire N 32-bit inputs in order to yield one 32-bit output. Currently, I am achieving this by implementing a pipelined binary tree of operations.

For a concrete example, supposing N = 4 and I have inputs {a,b,c,d} then I would do the following:

a op b => reg1
c op d => reg2
reg1 op reg2 => result

When a stage in the tree is not divisible by 2, I insert a dummy operation that takes only 1 input an yields 1 output with the same latency.

The issue that I have is that I am concerned with a few sizes of N inputs {9, 25, 49, 81, 121}. The largest size of N, 121, requires 110% of luts in my FPGA fabric while all other sizes fit easily. The tree of these binary operations is by far the largest consumer of luts in my design.

I am aware that I could reduce the utilization of luts by nearly half by halving the number of op circuits residing on my board and multiplexing their inputs. Unfortunately, this would mean only getting a result every other clock cycle and halving the bandwidth.

Because a full tree requires only ~10% more resources than the board offers, a 50% reduction in bandwidth seems like too significant of a hit. Is there an architecture where a I can make the trade off of a more fine grained reduction in size for a fine grained reduction in bandwidth?

Lincoln · Accepted Answer

Have you considered using 3, 4, 5, or 6 input operators? LUTs in current Altera and Xilinx FPGAs are 6 input. If your operators are really simple ( ie. AND, OR, XOR) you may be wasting a lot of LUTs in your current implementation by only using a 2 input function instead of a 6 input function ( assuming that you register each stage of the pipeline). This would turn your binary tree into a k-ary tree of operations. With your numbers would waste many LUTs depending on the choice of N.

Here are the choices of K that would make sense:

 N   | K | LUTs/bit | LUTs | LUTs/bit @k=2 | LUTs @k=2
-----|---|----------|------|---------------|-----------
 9   | 3 |     4    |  128 |      11       |    352
 25  | 5 |     6    |  192 |      25       |    800
 49  | 5 |    13    |  416 |      50       |   1600
 81  | 6 |    18    |  576 |      75       |   2400
121  | 6 |    26    |  832 |     123       |   3936

Paebbels · Answer

Ok, I took your example with 9 inputs and tried to solve your binary tree problem with 75% of the binary operators.

I name the inputs a,b,c,d,e,f,g,h,i. To distinguish the input values in time, I'll add a tick after a.

a'    = a at time 1
a'''  = a at time 3
(ab)' = result of a' and b'

9 inputs require 8 binary operations. I restrict my processing scheme to 6 operators, so it uses 75% of the needed operators. Each line in the following schema represents one operator.

time 1      time 2          time 3            time 4              | time 1+4
a'b'        (abcd)'(efgh)'  (ab)''(cd)''      e'''f'''            | a'b'
c'd'        (abcdefgh)'i'   (ef)''(gh)''      g'''h'''            | c'd'
e'f'        a''b''          (abcd)''(efgh)''  (ab)'''(cd)'''      | e'f' 
g'h'        c''d''          (abcdefgh)''i''   (ef)'''(gh)'''      | g'h'
(ab)'(cd)'  e''f''          a'''b'''          (abcd)'''(efgh)'''  | (ab)'(cd)'
(ef)'(gh)'  g''h''          c'''d'''          (abcdefgh)'''i'''   | (ef)'(gh)'

After 4 cycles the schema repeats. So the processing of 3 input sets requires 4 cycles:
=> 33 % overhead.

This schema can be sorted, so that every line processes only 2 different inputs:
=> the input can be muxed by a 2:1 multiplexer.

time 1      time 2          time 3            time 4              | time 1+4
a'b'        a''b''          a'''b'''          (ab)'''(cd)'''      | a'b'
c'd'        c''d''          c'''d'''          (ef)'''(gh)'''      | c'd'
e'f'        e''f''          (ab)''(cd)''      e'''f'''            | e'f' 
g'h'        g''h''          (ef)''(gh)''      g'''h'''            | g'h'
(ab)'(cd)'  (abcd)'(efgh)'  (abcd)''(efgh)''  (abcd)'''(efgh)'''  | (ab)'(cd)'
(ef)'(gh)'  (abcdefgh)'i'   (abcdefgh)''i''   (abcdefgh)'''i'''   | (ef)'(gh)'

If I made no mistakes, this should be the operation network:

schema
(clickable)

Is it possible to reduce the space requirement of a tree of binary operations on an FPGA at the expense of bandwidth by a factor of less than 2?

Tags:

architecture

tree

fpga

user3704133

2 Answers

Lincoln

Paebbels

Recent Activity

Donate For Us

Is it possible to reduce the space requirement of a tree of binary operations on an FPGA at the expense of bandwidth by a factor of less than 2?

Tags:

architecture

tree

fpga

user3704133

2 Answers

Lincoln

Paebbels

Related questions

Recent Activity

Donate For Us