Most perception algorithms have a high ratio of array variable accesses to operators. Multiple SRAM ports are essential for high throughput. The three dual ported SRAMs in Figure 9.1 together have a read/write power consumption approximately equal to the total function unit power consumption. Since each additional SRAM port introduces area and energy overhead, utilizing them effectively is essential for performance. A previous version of the architecture which used generic integer ALUs for address generation was unable to maximize SRAM port utilization [67]. This is because generating an address for a 2D array access involves multiply/shift and add operations which incurs multiple cycles of latency in a traditional processor. When a tight loop body involves several array accesses a significant fraction of the function units and registers will need to be allocated for address calculation rather than computing results. Since address calculation for arrays is a stylized operation, it is possible to design distributed semiautonomous address generation hardware that frees up function unit resources for the actual result computation and improves data delivery and throughput. In the perception processor, dedicated address generators are attached to each SRAM port. They handle commonly occurring address sequences like vector and strided access as well as 2D array accesses including row and column walks. They can handle address generation under regular, modulo and unrolled loops and can deal with special situations that occur when multiple loop bodies are in flight simultaneously.
Before entering into a loop intensive section of code, the compiler uses write_context instructions to write descriptions of array access patterns into the address context register files of address generators. For increased throughput the same access pattern may be written into multiple address generators. Each address context includes the row and element sizes, the base address as well as the loop counter indices that correspond to the array's loop variables. The loop counter indices may be used to retrieve the value of loop count variables generated by the loop unit in Figure 9.6. In the current implementation there are four context entries in each address generator corresponding to a total of 24 access patterns simultaneously. Since write_context is a single cycle operation, dynamic reconfiguration has very low overhead. The parameters for an array access pattern are packed into a single 32-bit word with the base address at the least significant bit. So arithmetic can be done on the packed word to update the base address dynamically.
Address computation for array and vector references issued to an SRAM port are handled by its attached stream address generator. The operation of the address generator depends on loop counters from the loop unit and array parameters like base address and row size that are stored in its address context register file. Figure 9.7 shows the internal structure of an address generator. To understand how this simple structure can accomplish a variety of address calculations, it is essential to understand how a compiler generates addresses for array references. Consider the 2D arrays declared and used in C as shown in Figure 9.8.
To simplify the discussion, assume word oriented addressing. Let the
size of the Complex
struct be denoted as elem_ size
. Then, the size of one row of A
is
. If the offset of imag within
the struct is 1 and the base address of
is
, then
the base addresses of the imag field will be
.
So the address expressions corresponding to the load into
is
since C stores
arrays in row major order. A vector is a single-dimensional array,
so its address expression is just a special case where
.
For more complex index expressions of the form
, the
factors
and
may be absorbed into the row size and base address
respectively. A column-walk of the form
can be evaluated
similarly. By constraining the row and element sizes to be powers
of two, the address expression reduces to the form
.
For cases where row size cannot be a power of two, to help pack more
data into the scratch memory, row size may be picked as the sum of
two powers of two and separate expressions may be used to access the
bulk of the array and the residue. For arrays with
dimensions,
the base address is repeatedly recalculated to account for
dimensions and the last two levels of loop nest are left to the hardware
to deal with. Not all array accesses need to use the same loop variables.
In the example, the access of
depends on
unlike
which
depends on
. The address generator is capable of picking the
correct loop variables and plugging them into the address expression.
Each address generator has a designated partner ALU in the cluster
with several address generators possibly sharing the same partner.
In cases where the address generator is not equipped to compute the
array index function, it is possible to directly issue an address
computed by its partner ALU. The partner ALU can also compute address
contexts on the fly and reconfigure an address generator. The combination
of an address generator and its partner ALU can effectively deal with
indirect access streams of the type
. Address generation
adds 1 cycle latency to load/store operations.
When the compiler emits code for an array access, the index of an address generator and the index of an address context register within that generator are encoded into the context index field of the load/store instruction. The selected address generator then uses the context index field to retrieve the array parameters from the context register file as shown in Figure 9.7. The retrieved context entry specifies the loop variables to be used for calculating the address. The muxes at the top right of the figure use this information to select the appropriate loop counters. The mux inputs are connected to the loop count registers of the loop unit shown in Figure 9.6. The shifters then shift the selected loop variables, and the result is OR-ed and added to the base address to generate an address. To improve the processor's cycle time, pipeline registers have been inserted just before the final add operation.
Several special cases are handled in the address generator. It is common to unroll loops by a small factor and software pipeline them for performance. In that case, instead of using two loop variables, it is possible to use one loop variable and one unroll factor to compute the address. The unroll factor is packed into the immediate field of the instruction and selected in lieu of the loop variable using the upper 2-to-1 mux in Figure 9.7. When the access pattern is too complex to be handled by the address generator, the lower 2-to-1 mux selects an address that is computed by an ALU. To handle vectors and ALU generated addresses with one or zero loop variables respectively, the loop unit has a special loop counter which is always zero.