The Imagine image and signal processor developed by Prof. William Dally and the Concurrent VLSI Architecture Group (CVA) at Stanford university was the pioneering project in stream processing . Figure 4 shows the internal structure of an Imagine processor. The Imagine processor consists of eight execution clusters where each cluster contains six ALUs resulting in a peak execution rate of 48 arithmetic operations per second. Each cluster executes a VLIW instruction under the control of the micro-controller. The same VLIW instruction is issued to all clusters resulting in the instruction fetch and control overhead being amortized over 8-way SIMD execution. The bandwidth hierarchy consists of local register files (LRF) attached to each function unit that provide 435GB/s, a stream register file (SRF) that feeds the LRFs at 25.6 GB/s and a streaming memory system that feeds the SRF at 2.1 GB/s. The LRFs within each cluster are connected directly to function units but can accept results from other function units over an intra-cluster switching network. Each cluster also contains a communication unit that can send and receive results from other units over an inter-cluster switching network. The SRF is internally split into banks that serve each cluster. In addition, each cluster also has a 256 word 32-bit scratchpad memory. The host processor queues kernels for execution with the stream controller via the host interface. The stream controller initiates stream loads and stores via the stream memory system, uses an internal scoreboard to ensure that dependences are satisfied, and then lets the micro-controller sequence the execution of the next kernel function whose dependences have all been satisfied. The Imagine processor was fabricated in a CMOS process and achieved 7.96 GFLOPS and 25.4 GOPS at 200 MHz. Imagine was succeeded by the Merrimac project where the focus was on developing a streaming super computer for scientific computing.
Both Imagine and Merrimac were developed primarily around the concept of time multiplexing. Thus they are optimized for applications with very high levels of data parallelism. In a multi-chip configuration, it is possible to space-time multiplex these systems by making kernels on each node communicate with their counterparts on other nodes over a network interface.