2.5 Balancing Performance and Power Consumption
Given the rising interest in mobile devices and the widespread use of embedded processors in control and monitoring applications, a large body of existing work has been devoted to achieving high computational performance while also improving power efficiency. The approach taken in this dissertation is to control a clock gated VLIW processor consisting of a cluster execution units and a special purpose scratch-pad memory system at a very fine granularity using horizontal microcode. All communication within the cluster is scheduled under software control - a technique that will be referred to as compiler controlled dataflow. In addition, the clock signal to each function unit is controlled by the software on a cycle by cycle basis. This is called compiler controlled clock gating. The details appear later in Chapter 9, but this synopsis is useful in considering the relevance of preexisting approaches.
There are many vendors of high performance power efficient embedded processors such as the Philips Trimedia, TI C62xx, and Lucent DSP16000 that can be effectively scheduled to achieve reasonably low power operation [47,100,3]. Increasing performance via VLIW instruction scheduling and instruction width reduction techniques is a common theme in modern embedded systems [63,108,16,8]. Efforts have demonstrated the benefit of VLIW architectures for customization and power management . Optimization techniques for clustered VLIW architectures can also be found in the literature . However, these efforts do not address low-level communication issues. Caliber uses an interesting software pipelining strategy that is targeted at reducing memory pressure in VLIW systems. The primary mechanism is to distribute the register file [8,7]. In contrast, in this dissertation, the output stage pipeline registers of function units and the associated forwarding paths will be managed as if they constituted a small distributed register file. Tiwari et al. have explored scheduling algorithms for less flexible architectures, which split an application between a general purpose processor and an ASIC . Lee investigated the power benefits of instruction scheduling for DSP processors . Eckstein and Krall focus on minimizing the cost of local variable access to reduce power consumption in DSP processors . Application-specific VLIW clusters have been investigated by many researchers [60,35]. Customizing a VLIW processor to minimize power and maximize performance by only including the necessary function units and specializing function units via operator fusion has been studied and utilized by the Tensilica Corporation in their Xtensa architecture . The fine grain horizontal microcode approach taken in this dissertation can be viewed as a fine-grained extension of the VLIW concept. However the addition of sophisticated address generators, multiple address contexts per address generator, the removal of the register file, and the fine-grained steering of data are aspects presented in Chapter 9 that are not evident in these other efforts.
The MOVE family of architectures explored the concept of transport triggering, where computation is done by transferring values to the operand registers of a function unit and starting an operation implicitly via a move targeting a trigger register associated with the function unit . Like in the MOVE architecture, the concept of compiler directed data transfer between function units is used in this dissertation too, but the resultant architecture is a traditional operation triggered one and transport triggering is not used.
The RAW machine has demonstrated the advantages of low level scheduling of data movement and processing in function units spread over a two-dimensional space [104,62]. The RAW work is similar to the research presented in this dissertation in many ways. Low-level architectural resources are fully exposed to the compiler. Custom data flows are scheduled by the compiler on resources that are inherently somewhat general purpose. The primary differences arise from the basic design target. The RAW effort is directed at demonstrating that high levels of performance can be achieved on an architecture consisting of many fine-grained tiles. This dissertation is directed at demonstrating that somewhat general purpose structures can be scheduled to achieve power efficiency that competes with special purpose ASIC designs.
The Imagine architecture is organized to exploit high levels of internal bandwidth in order to achieve high performance levels on stream based data . Scheduling issues are similar, but the target is performance rather than low power. Given the poor wire scaling properties of deep submicron CMOS processes, it is somewhat inevitable that function unit clusters will need to be considered in order to manage communication delays in high performance wide issue super-scalar processors. Current DSP processors like the TMS320C6000 already have clustered datapaths and register files . These approaches however are all focused on providing increased performance. The approach taken in this dissertation is to improve both power and performance while retaining a large degree of programmability.
One popular approach to specialization is the use of reconfigurable logic to provide customization. Techniques vary from power aware mapping of designs onto commercially available FPGA devices to hybrid methods where specialized function blocks are embedded into a reconfigurable logic array [20,18,69,31]. Of particular relevance are compiler directed approaches that are similar to the compiler-controlled dataflow approach used in this research . However, this dissertation targets custom silicon implementations rather than the higher level FPGA domain. FPGA based approaches have a significant advantage when the phases of an application persist long enough to amortize the relatively long reconfiguration times. The generality of the FPGA approach also leads to excessive energy loss. The approach taken here is commensurate with more rapid reconfiguration and exhibits significantly better energy efficiency.
A number of researchers have tried to predict the energy consumption of an application running on a particular processor . Wattch is a well known example of high level simulation based power estimation . Such high level approaches have a number of benefits. They are useful early in the design flow, and the simulations are several orders of magnitude faster than low level estimation using tools like Spice. The disadvantage is that Wattch-like systems need to be calibrated to use high level power models that take into account all the implementation specific details. When the actual implementation differs from the power model provided to the tool, the power estimate will be meaningless. Since the perception processor architecture described later in this dissertation is significantly different from general purpose architectures modeled by Wattch, the power estimates reported in this work will be based on low-level Spice simulation of actual circuits.
Clock power is often the largest energy culprit in a complex design such as a modern microprocessor [41,96]. This is primarily because the clock signal potentially goes everywhere on the chip. Clock gating is a popular technique that selectively turns off the clock to portions of the chip that are not used at a particular time. Krashinsky studied the benefits of clock gating applied at various levels of aggression on a microprocessor design . Tseng and Asanovic describe a technique that conserves register file power when the value will be supplied from a bypass path . This is similar in spirit to compiler-controlled dataflow used in this dissertation except that the architecture described in Chapter 9 eliminates the register file altogether and uses the bypass paths to forward all values. There are two disadvantages to clock gating: the enable signal must arrive sufficiently ahead of the clock signal, and the use of additional gates in the signal path will increase clock skew. Both effects reduce the maximum achievable clock frequency. For low-power design objectives, this is seldom a serious issue.
Modulo scheduling is a well known software pipelining approach for VLIW processors . It permits multiple loop bodies to be simultaneously in flight within a clustered VLIW processor. The perception processor discussed in this dissertation relies heavily on modulo scheduling to achieve high performance. The regular nature of modulo scheduled loops makes them amenable to algorithmic level power analysis and optimization. While the compiler controlled clock-gating explored in this dissertation has been free of problems, such fine grain management of power could lead to excessive power line noise otherwise known as the effect. In such cases it is possible for a compiler to introduce additional dummy operations into a modulo scheduled loop to reduce power line disturbance. Yun and Kim present a power aware modulo scheduling algorithm that could limit power fluctuations .
While using custom coprocessors to accelerate applications is a well established idea, recently researchers have started emphasizing it as a means of reducing power consumption. PipeRench is one such programmable datapath developed at CMU. PipeRench uses self-controlled runtime reconfiguration and virtualization of hardware to execute a 40 tap 16 bit FIR filter processing 41.8 million samples per second and the IDEA encryption algorithm at 450 Mbps while operating at 120 MHz . Power consumption on 15-20 filter taps while operating at 33.3 MHz is in the 600-700 mW range. Pleiades is a reconfigurable DSP architecture developed at UC Berkeley. It is a domain specific processor that trades off the flexibility of a general purpose processor for higher energy efficiency. The Pleiades designers report that their architecture consumes only 14 nano Joules per stage of an FFT computation while the Intel Strong ARM and the Texas Instruments TMS320C2xx consume 36 nJ and 15 nJ respectively after normalizing for CMOS process parameters . The opportunities for special purpose architectures to improve on the power consumption and the performance of general purpose devices are numerous. Direct comparison against such systems is often impossible because of their unavailability and due to the difference in the domains they are targeted at. For this reason, the approach described in this dissertation will be compared against commercially available general purpose processors and ASIC implementation of algorithms, not against domain specific accelerators in the literature.