The index expressions of array accesses in a multilevel nested loop will depend on some subset of the loop variables. The purpose of the loop unit is to compute and maintain the loop variables required for address generation in the memory system while the loop body itself is executed in the function units. Figure 9.6 shows a simplified organization of the loop unit. The loop unit offers hardware support for modulo scheduling, a software pipelining technique that offers high levels of loop performance in VLIW architectures [81].
A brief introduction to some modulo scheduling terminology is necessary
to understand the functioning of the loop unit. Assume a loop body
which takes
cycles to execute. Modulo scheduling allows starting
the execution of a new instance of this loop body every
(Initiation
Interval) cycles where
is less than
. A normal loop that
is not modulo scheduled may be considered a modulo scheduled loop
. How
is determined and the conditions that must be satisfied
by the loop body are described in [81]. The original
loop body may be converted to a modulo scheduled loop body by replicating
instructions such that every instruction that was originally scheduled
in cycle
is replicated so that it also appears in all possible
cycles
where
is an integer. This
has the effect of pasting a new copy of the loop body at intervals
of
cycles over the original loop body and wrapping around all
instructions that appear after cycle
. If a particular instruction
is scheduled for cycle
, then
is called its modulo period.
The compiler configures static parameters including
and loop
count limits into loop context registers. The corresponding dynamic
values of the loop variables are held in the loop counter register
file. The only other piece of information required is which loop body
is currently pointed to by the program counter. A four-entry loop
stack captures this information. In this implementation, the loop
unit can keep track of four levels of loop nest at a time, which is
sufficient for the benchmarks used in this research. For larger loop
nests the address expressions that depend on additional outer loops
may be done in software as in a traditional processor. A four-entry
loop context register file holds the encoded start and end counts
and the increment of up to four innermost for loops. Loops
are a resource that can be allocated and managed just like one would
allocate memory on a traditional architecture. The loop unit maintains
a counter for each loop nest and updates it periodically. It also
modifies the program counter and admits new loop bodies into the pipeline
in the case of modulo loops. In that case it also does additional
manipulation of the loop counter to drain the pipeline correctly on
loop termination. On entering a new loop any previous loop is pushed
on a stack, though its counter value is still available for use by
address generators. Loop parameters may be loaded from memory. This
permits modulo scheduling of loops whose loop counts are not known
at compile time. Appropriate loop parameters may be loaded from SRAM
at run time depending on the size of input data.
Just before starting a loop intensive section of code, loop parameters (perhaps dynamically computed) are written into the context registers using write_context instructions. On entry into each loop body, a push_loop instruction pushes the index of the context register for that loop onto the stack. At any given moment, the top of the stack represents the innermost loop that is being executed at that time. An II counter repeatedly counts up to the initiation interval and then resets itself. Every II cycles, the loop increment is added to the loop variable that is held in the loop counter register file. This is done automatically. No loop increment instructions are required. When the end count of the loop is reached, the innermost loop will have completed. The top entry is automatically popped off the stack, and the process is repeated for the enclosing loop. Note from Figure 9.6 that the registers and datapaths have small widths of 4 and 9 bits that cover most common loops. These widths are parameters specified in the perception processor configuration. The netlist generator tool can generate perception processors which use any user specified widths. The choices in Figure 9.6 were sufficient to cover benchmarks used in this research. Loops that are incompatible with a particular perception processor configuration can always be done in software, so the reduced bit-widths save energy in the common case.