Setting the modulo period field in load.context/store.context instructions to a nonzero value unlocks a performance enhancing feature called Array Variable Renaming. Modulo scheduling makes it is possible to overlap the execution of multiple instances of the inner loop body. Assume that the k loop from Figure 9.8 has a latency of 30 cycles and that after satisfying resource conflicts and data dependences it is possible to start a new copy of the loop body every 5 cycles. Then, up to 6 copies of the loop body could be in flight through the execution pipeline. To get data dependences correct for new loop bodies, the loop variable should be incremented every 5 cycles. However, when it is incremented, old instances of the loop body that are in flight will get the wrong value and violate dependences for load/store instructions that happen close to the end of the loop body.
The traditional solution is to use multiple copies of the loop variable
in conjunction with the VLIW equivalent of register-renaming - a
rotating register file. Multiple address calculations are performed,
the appropriate values loaded into the register file and the register
file is rotated. For long latency loop bodies with short initiation
intervals, this leads to increased register pressure. The solution
to this problem is to increment a single copy of the loop variable
every initiation interval and compensate for the increment in older
copies of the loop body which are in flight. The compensation factor,
which is really the modulo period, is encoded into the immediate field
of load/store instructions. It is subtracted from the loop variable's
value to cause dependences to resolve correctly. In effect, this has
the effect of rotating the array variable and letting a generic
expression like
be rebound to separate addresses. Array
variable renaming, effectively converts the entire scratch pad memory
into a rotating register file with separate virtual rotating registers
for each array accessed in a loop. Array variable renaming is much
more powerful than register rotation, but it can also be used in conjunction
with a rotating register file. This could be useful in cases in which
it is possible to custom design rotating register files that have
lower latency than the SRAM and address generator combination used
to implement array renaming. Such a combination of array renaming
and register rotation can capitalize on the flexibility provided by
array renaming and the low latency provided by a custom designed rotating
register file. The perception processor does not have an architected
register file at all - it merely uses array variable renaming in
the place of register-renaming to achieves very high throughput at
low power.