This section illustrates the operation of the perception processor
using a simple kernel that is mapped into microcode. The algorithm
to multiply two
floating point matrices is shown in
Figure 9.9. The control pattern consists of 3 level
nested for loops. Assuming that the matrices are stored in
row major order, the inner product computation will access array
along the row while
will be accessed along the column causing
a base stride access pattern. The compute pattern consists of multiply
accumulate operations, which form the core of the inner product function.
Figure 9.10 outlines a simple custom hardware accelerator
for this algorithm. Address generator
fetches the rows of matrix
. Address generator
generates the base stride pattern for
the columns of matrix
. Corresponding rows and columns are fetched
and applied to the floating point multiplier. The output of the multiplier
is accumulated in a scratch register by the floating point adder.
When an inner product sum is ready, it is written to a result SRAM,
which is not shown in the figure.
In theory, this simple pipeline could compute one inner product every
16 cycles. However, the final accumulation of the inner product value
creates a pipeline problem. The floating point add takes 7 cycles
and since the output is accumulated, a new product value can only
be handled every 7 cycles. Hence each inner product takes
cycles. Interleaving the computation of 7 or more inner products relieves
this bottleneck. However, this interleave complicates address generation.
The additional functionality required to fix this problem includes:
a) address generator
needs to be able to generate multiple interleaved
base-stride patterns b) address generator
needs to hold each
row element long enough for all the interleaved inner products and,
c) several scratch registers are required to hold the intermediate
sums. If the interleave factor is the same as the latency of the floating
point adder, no scratch registers are required. The output of the
adder may be fed back as an input and the intermediate sums will circulate
through the pipeline registers of the adder.
Compilers for high performance architectures attempt to approximate the dataflow in the custom accelerator. In vector processors, vector chaining creates a similar dataflow and reduction operators help alleviate some of the performance penalty caused by the floating point accumulate operation. By selecting independent adds and multiplies, which are ready for issue from its instruction window, an out of order processor will work somewhat like a vector processor that can be time sliced across several interleaved vectors. In addition, a combination of software pipelining and branch prediction ensures that the pipeline has as few wasted cycles as possible. Address generation will be handled by generic ALUs which send computed addresses to available load/store ports. Some form of register renaming will also be required to enable software pipelining to work well in nontrivial kernels.
Figure 9.11 shows the cleaned up perception processor
assembly code for the interleaved inner product. For brevity the outer
loops, which invoke the interleaved inner product, are not shown.
This code is capable of sustaining the same throughput (7 inner products
every
cycles) as the refined custom hardware accelerator.
Performance and energy efficiency are achieved by a combination of
techniques.
The inner product loop i_loop is marked for hardware modulo loop acceleration, and its parameters are configured into a free context in the loop unit. Two address contexts A_ri and B_ci are allocated and the address generators attached to the input SRAM ports are reconfigured. Both contexts are tied to the loop i_loop. B_ci is set to generate a column walk indexed by i_loop, with the starting offset specified in a constant field in the load opcode. A_ri is set to access the matrix row by row in conjunction with an outer loop. The address contexts effectively implement array variable renaming functions, a fact which is not evident in the code.
On entering i_loop the previous loop is pushed on a stack, though its counter value is still available for use by the address contexts, particularly A_ri. The new loop updates its counter every 7 cycles and admits new loop bodies into the pipeline. This is not a branch in a traditional sense and there is no branch penalty.
Communication is explicit and happens via load/store instructions
or via interfunction unit data transfers, both of which explicitly
address pipeline registers. In the example
and
are allocated to pipeline registers
and
respectively. In fact, it is more appropriate to say that
where
refers to the
interleaved inner product resides
in
at time
. No scratch registers are required
for the sum. The intermediate sums are merely circulated through the
long latency FPU adder. This notion of allocating variables both in
time and space is central to programming the perception processor.
The return value of each opcode mnemonic is the relative time at which
its result is available. The
pseudo op is a compile time directive
that controls the relative time step in which following instructions
are executed. Dataflow is arranged by referring to the producer of
a value and the time step it is produced in. Such a reference will
be translated by the compiler into commands for the forwarding logic.
More complex programs are written as several independent execution
streams. The streams are then made to rendezvous at a particular cycle
by adjusting the starting time of each stream. The example shows that
compile time pseudo ops can perform arithmetic on relative times to
ensure correct dataflow without the programmer needing to be aware
of the latencies of the actual hardware implementation.
The loop body for
will consist of 7 inner loop bodies
created by loop unrolling. Each inner loop body before unrolling takes
18 cycles to execute. Since
has been specified to have
an initiation interval of 7 cycles, a total of 3
bodies
corresponding to 21 of the original loop bodies will be in flight
within the cluster at a time. It is the modulo aware nature of the
address generators that permits each of these loop bodies to refer
to array variables in a generic manner like
and get the
reference that is appropriate for the value of
and
which
were current at the time that loop body was started. Without special
purpose address generation, such high levels of ILP will not be possible.
A previous version of the architecture without modulo address generators
had limited ILP because generic function units and registers were
used for address generation [67].
For this example, interleaving 7 inner products at a time results
in two left over columns. They are handled by a similar loop to the
one shown in Figure 9.11 except that it will have
more idle slots. The adder needs to be active all the time, but the
multiplier needs to work only 2 out of every 7 cycles. Since the multiplier
pipeline will not shift 5 out of 7 cycles, the dynamic energy consumption
resembles an ideal circuit where the adder runs at full frequency
and the multiplier runs at
of the frequency thereby consuming
less energy.
The overall effect is that the dataflow and throughput of the perception processor matches the custom hardware but in a more programmable manner. The address generators transfer data between the SRAMs and execution units in a distributed and autonomous manner similar to the custom accelerator in Figure 9.10. The output of the multiplier is directly forwarded to the input of the adder. As in the case of the accelerator, no scratch registers are used. The intermediate sums are circulated through the pipeline registers in the adder. All together, the microcode and the interconnect provide a level of programmability while retaining a level of hardware economy close to that of the ASIC.