Most current research in processor architecture revolves around optimizing four criteria: energy (or power), delay, area and reliability. For scientific applications, memory bandwidth is also a precious commodity that critically affects delay. One or more of these criteria can often be traded off for the sake of another. For processors, it is often the case that the product of energy and delay required to process a given work load is relatively constant across different architectures after normalizing for the CMOS process [2]. To achieve a significant improvement in the energy-delay product, the architecture needs to be optimized to exploit the characteristics of the target application. Stream applications exhibit three forms of parallelism that can be taken advantage of: instruction level parallelism (execute independent instructions in parallel), data parallelism (operate on multiple data elements at once, often using SIMD) and task parallelism (execute different tasks in parallel on different processors).
Current high-performance dynamically scheduled out-of-order processors
are optimized for applications that have a limited amount of instruction
level parallelism (ILP). They do not depend heavily on task level
parallelism as evidenced by the fact that most processors support
only one or two way SMT and one or two cores per chip. Deep sub-micron
CMOS processes offer the opportunity to fabricate several thousands
of 32-bit adders or multipliers on a
microprocessor
die. Yet, because of the limited ILP and irregular nature of typical
applications most micro-processors have six or fewer function units.
The bulk of the area and power is consumed by caches, branch predictors,
instruction windows and other structures associated with identifying
and exploiting ILP and speculation. Stream applications on the other
hand have very regular structures, are loop oriented and exhibit very
high levels of ILP. In fact, Kapasi et al report being able to achieve
28 to 53 instructions per cycle for a set of stream applications on
the Imagine stream processor [9]. To achieve such
high levels of parallelism, stream processors typically utilize a
large number of function units that are fed by a hierarchy of local
register files, stream register files and stream caches that decouple
memory access from program execution. The resulting bandwidth hierarchy
can exploit data parallelism and main memory bandwidth much more efficiently
than traditional processors resulting in better performance per unit
area or unit power. In addition, it is possible to construct multiple
stream processors on the same chip and use stream communication mechanisms
to ensure high bandwidth data flow and exploit task level parallelism.
The trade-off when compared to general purpose processors is a much
more restrictive programming model and applicability that is limited
to the streaming domain