Most current research in processor architecture revolves around optimizing four criteria: energy (or power), delay, area and reliability. For scientific applications, memory bandwidth is also a precious commodity that critically affects delay. One or more of these criteria can often be traded off for the sake of another. For processors, it is often the case that the product of energy and delay required to process a given work load is relatively constant across different architectures after normalizing for the CMOS process . To achieve a significant improvement in the energy-delay product, the architecture needs to be optimized to exploit the characteristics of the target application. Stream applications exhibit three forms of parallelism that can be taken advantage of: instruction level parallelism (execute independent instructions in parallel), data parallelism (operate on multiple data elements at once, often using SIMD) and task parallelism (execute different tasks in parallel on different processors).
Current high-performance dynamically scheduled out-of-order processors are optimized for applications that have a limited amount of instruction level parallelism (ILP). They do not depend heavily on task level parallelism as evidenced by the fact that most processors support only one or two way SMT and one or two cores per chip. Deep sub-micron CMOS processes offer the opportunity to fabricate several thousands of 32-bit adders or multipliers on a microprocessor die. Yet, because of the limited ILP and irregular nature of typical applications most micro-processors have six or fewer function units. The bulk of the area and power is consumed by caches, branch predictors, instruction windows and other structures associated with identifying and exploiting ILP and speculation. Stream applications on the other hand have very regular structures, are loop oriented and exhibit very high levels of ILP. In fact, Kapasi et al report being able to achieve 28 to 53 instructions per cycle for a set of stream applications on the Imagine stream processor . To achieve such high levels of parallelism, stream processors typically utilize a large number of function units that are fed by a hierarchy of local register files, stream register files and stream caches that decouple memory access from program execution. The resulting bandwidth hierarchy can exploit data parallelism and main memory bandwidth much more efficiently than traditional processors resulting in better performance per unit area or unit power. In addition, it is possible to construct multiple stream processors on the same chip and use stream communication mechanisms to ensure high bandwidth data flow and exploit task level parallelism. The trade-off when compared to general purpose processors is a much more restrictive programming model and applicability that is limited to the streaming domain