Perception applications are stream oriented with a large number of
2D array and vector accesses per elementary operation. These accesses
typically occur within tight loops with known bounds. Traditional
processors have a limited number or load/store ports, and this limits
overall performance because of the high number of array accesses,
which is the reason DSPs traditionally partition their memory resources.
A large number of SRAM ports are required to efficiently feed data
to function units. Increasing the number of ports on a single SRAM
or cache increases access time and power consumption. This motivates
the choice of multiple small software managed scratch SRAMs. It is
also possible to power down SRAMs that are not required. For low leakage
processes a large fraction of the energy consumption is in the sense
amplifiers of the SRAM ports. They consume approximately 50% of the
processor energy in the 0.25
implementation. Mechanisms to efficiently
use these expensive resources are important for both performance and
energy conservation.
Hardware performance counter based measurements on a MIPS R14K processor showed that 32.5% (Geometric mean) of the executed instructions were loads/stores for a set of perception benchmarks described later in Section 10.1. The high rate of load/store operations combined with the regular array access patterns makes it possible to overlap computation and SRAM access possible using hardware accelerators. A large fraction of the remaining 67.5% execution component is array address calculations that support load/store operations. Significant optimizations are possible by associating each SRAM port with an address generator that deals with common access patterns of streaming applications. The access patterns include 2D array and vector accesses in modulo scheduled or software pipelines loops. Details may be found in Section 9.6.4.
Four new instructions are required to take advantage of the optimizations:
:
Reconfigure an address generator by transferring a description of
an access pattern into a context register within the memory system.
This instruction when applied to the loop unit similarly transfers
the parameters of a loop into a loop context register.
and
:
These are loads/stores that use the address generation mechanism.
The
encoded into the immediate constant field of
the instruction specifies the address generator to be used and the
index of a context register within it.
:
Let the memory system know that a new loop is starting.