Figure 6.2 shows the architecture of
the accelerator. The datapath consists of an
floating point unit, followed by an adder that accumulates the sum
as well as a fused multiply add
unit that performs
the final scaling. Given that X, Mean, and Var are 39-element vectors,
a vector style architecture is suggested. The problem comes in the
accumulation step, since this operation depends on the sum from the
previous cycle, and floating point adders have multicycle latencies.
For a vector length of N and an addition latency of M, a straightforward
implementation takes
cycles. Binary tree reduction
(similar to an optimal merge algorithm) is possible, but even then
the whole loop cannot be pipelined with unit initiation interval.
This problem is solved using by reordering Loops 1,2,3 to a 2,3,1
order. This calculates an
term for each input
block while reading out the mean and variance values just once from
the SRAM. Effectively this is an interleaved execution of 10 separate
vectors on a single function unit, which leaves enough time to do
a floating point addition of a partial sum term before the next term
arrives for that vector. The cost is 10 internal registers to maintain
partial sums. Loops 2,3,1 can now be pipelined with unit initiation
interval. In the original algorithm, the Mean/Var SRAM is accessed
every cycle whereas with the loop interchanged version this 64-bit
wide SRAM is accessed only once every 10 cycles. Since SRAM read current
is comparable to function unit current in the CMOS technology used
for this design, the loop interchange also contributes significant
savings in power consumption.
The Final Sigma unit in Figure 6.2
works in a similar manner, except that instead of a floating point
adder, it uses a fused multiply add unit. It scales the sum and adds
the final weight. This unit has a fairly low utilization since it
receives only
inputs every
cycles.
To save power this unit is disabled when it is idle. In a multichannel
configuration it is possible to share this unit between multiple channels.
To reduce the number of reads the processor needs to perform to fetch
results from the accelerator, this unit may be made to accumulate
the final score. This also serves to reduce the outgoing bandwidth
from the processor by a factor of eight. In that case, due to the
interleaved execution this unit also requires 10 intermediate sum
registers. Log domain addition can be implemented using an integer
subtract, table lookup and an integer add operation. The state machine
needs to be adapted to recirculate the results through the the integer
add/subtract unit within the floating point adder. The lookup table
used for extrapolation is constant and can therefore be implemented
as optimized logic within the state machine. In this design, log domain
addition is implemented in software.