Though CMOS circuits often have the ability to trade energy for performance,
it is quite difficult to improve both energy and performance simultaneously.
Gonzalez and Horowitz argue that the process normalized energy delay
product (EDP) or alternately,
, which corresponds
to the inverse of EDP, is a relatively implementation neutral metric
[39]. They demonstrate that this metric causes
the architectural improvements that contribute the most to both performance
and energy efficiency to stand out. For example, their results demonstrate
that pipelining is of fundamental importance to processor performance
and energy efficiency, but super scalar issue is a lesser contribution.
Figure 10.5 shows the process normalized energy
delay product (EDP) of the four different designs. It may be seen
that in spite of their radically different architectures, the XScale's
EDP is within 31.4% of the EDP of the Pentium if we ignore the outliers
FFT and Fleshtone. The FFT result is different because the XScale
uses a simple radix-2 algorithm instead of the optimized FFTW library
used on the Pentium. The Fleshtone result underlies the fact that
for this floating point benchmark, the XScale is modeled as an ideal
implementation. The floating point version of this algorithm has a
performance problem on the Pentium as explained in Section 10.4.3.
It is evident from Figure 10.5 that the perception processor has a radically better EDP, which is often one or two orders of magnitude better than its competition. It is particularly noteworthy that in the case of FFT where the perception processor achieved only 64% of the throughput of the Pentium, it improves EDP by a factor of 24.5. This may be largely attributed to the higher energy efficiency of the perception processor. The perception processor on average improves on the EDP of the XScale by a factor of 159 and is only 12 times worse than the ASIC. The perception processor is thus able to bridge the wide gap in EDP between CPUs and ASICs.