Equations 3.1 and 3.3
point to several power reduction strategies. For instance, power consumption
can be reduced by increasing IPC. However, modern dynamically scheduled
processors also increase the value of
when they increase IPC
due to the introduction of large reorder buffers, complex cache structures,
register renaming and support for speculative execution. Architectures
that can provide high IPC without an inordinate rise in the value
of
will lead to low power consumption. This can be achieved at
the cost of generality by using simple application domain specific
ILP enhancing mechanisms as well as by taking advantage of compiler
driven static ILP improvements. Increasing the issue width causes
some increase in power consumption because of the wider structures
used to support multiple issue. Since most of the ILP extraction is
done at compile time, and because the additional logic can be tailored
to take advantage of domain specific optimizations, the strategy leads
to a net power savings in the end.
Another architectural means of reducing power consumption is to decrease
the activity factor
. Clock gating provides one method of reducing
the activity factor [96]. Designing structures that isolate
activity happening in one part from being visible in other parts is
another useful technique. A typical example is the forwarding paths
of a super-scalar microprocessor. A forwarding mux connected to the
output of a function unit makes the value changes occurring in the
final stage of that unit visible at the inputs of other function units
even when the receiving units do not need the forwarded value. This
leads to unnecessary switching activity and power dissipation at the
receiving side. When the forwarding path is not needed, the mux select
signals can be manipulated so that unnecessary value changes are not
visible at the receiving side. This strategy called operand
isolation was utilized in the IBM PowerPC 4xx embedded controllers
[27]. Operand isolation under compiler control is used
as a power saving strategy for the perception processor described
in Chapter 9.
Lowering the ideal operating frequency also permits the use of a lower
supply voltage, which results in power savings. If frequency is directly
proportional to supply voltage, Equation 3.1
predicts cubic power reduction. However, in reality,
where
is a device saturation constant whose value ranges
from zero to two when velocity saturation is not explicitly modeled
[12]. Considering this relationship, quadratic or linear
power savings may be obtained by lowering the supply voltage and operating
frequency. This strategy capitalizes on the results produced by researchers
exploring ideal voltage selection and voltage scaling [76].
Equation 3.1 applies only within a narrow, process
specific, supply voltage range.
Ultimately, the average IPC available in an application is limited
by the dependences between instructions. Further improvements may
be obtained by multithreading the application, in which case
in Equation 3.3 corresponds to the aggregate
IPCs of the individual threads. Traditional high performance multiprocessors
exact a high energy price because of the complexities of memory system
coherence and interthread communication. By tailoring a multiprocessor
system to the information flow and synchronization patterns found
in perception applications, it is possible to design simple architectures
that provide sufficient generality for the perception domain.
Perception applications are usually stream oriented. They consist of a pipeline of algorithms, most of which are compute and memory intensive. Each phase typically touches and discards a large data set in a block oriented manner, i.e., several input blocks and a few blocks of local state are consulted to compute a block of output. There is little or no reuse of the high bandwidth input data, which is comprised of both input signals and massive knowledge bases that are too large to cache on-chip. One or more phases may be executed on a processor, and multiple processors may be connected in a pipeline fashion for efficient interphase communication while harvesting thread level parallelism.