9. Perception Processor Architecture
Chapter 3 explained that achieving high IPC was critical to realizing high-performance, low-power perception processors. Chapters 4 and 7 described the structure of typical perception algorithms, which are characterized by simple multilevel nested loops where the majority of arithmetic and floating point operators have array and vector operands. Operand availability is therefore critical to achieving high IPC. It was also seen that perception applications may be expressed as a pipeline of algorithms. These facts motivate the choice of architectures that embody function unit clusters for high ILP and simple communication mechanisms that permit chaining multiple processors to implement a pipeline of algorithms. Perception processors that are general enough to be able to execute multiple algorithms yet are small enough to conserve energy and die area would be ideal. An empirical search for a processor architecture that satisfies the generality, high IPC, and low resource utilization criteria led to an initial architecture  that was successively refined. The end result of this evolutionary process is depicted in Figure 9.1.
The perception processor architecture consists of a set of clock gated function units, a loop unit, three dual ported SRAMs, six address generators (one for each SRAM port), local bypass paths between neighboring function units as well as a cluster wide interconnect. A register file is conspicuously absent because the combination of compiler controlled dataflow and a technique called array variable renaming makes a register file unnecessary. Though none of the clusters described here need a register file, it is possible to incorporate one into a function unit slot. Clusters can be configured to maximize the performance of any particular application or set of applications. Typically there will be a minimum number of integer ALUs as well as additional units that are more specialized. Hardware descriptions for the cluster and the interconnect are automatically generated by a cluster generator tool from a configuration description. Details may be found in Section 9.8.
To understand the rationale behind this organization it is important to know that typical stream oriented loop kernels found in perception algorithms may be split into three components. They consist of control patterns, access patterns and compute patterns. The control pattern is typically a set of nested for loops. Access patterns seen in these algorithms are row and column walks of 2D arrays, vector accesses and more complex patterns produced when simple array accesses are interleaved or software pipelined. Compute patterns correspond to the dataflow between operators within the loop body. For example, the compute pattern of a vector dot product is a multiply-accumulate flow where a multiplier and an adder are cascaded and the adders output is fed back as one of its inputs.
The perception processor has programmable hardware resources that accelerate each of the three patterns found in loops. The loop unit accelerates control patterns while the address generators cover access patterns. The interconnect and the function units together implement compute patterns. The execution cluster operates in a VLIW manner under the control of horizontal microcode stored in the microcode SRAM. The microcode provides the opportunity to clock gate each resource individually on a cycle by cycle basis leading to low energy consumption. Together, these features provide the mix of high performance and hardware minimality that is crucial to perception applications.
- 9.1 Pipeline Structure
- 9.2 Instruction Format
- 9.3 Function Units
- 9.4 Compiler Controlled Dataflow
- 9.5 Interconnect
- 9.6 Memory System Architecture
- 9.6.1 Loop Unit
- 9.6.2 Stream Address Generators
- 9.6.3 Array Variable Renaming
- 9.6.4 Addressing Modes
- 9.7 Compiler Controlled Clock Gating
- 9.8 Design Flow
- 9.9 Programming Example