The Imagine image and signal processor developed by Prof. William
Dally and the Concurrent VLSI Architecture Group (CVA) at Stanford
university was the pioneering project in stream processing [9].
Figure 4 shows the internal structure
of an Imagine processor. The Imagine processor consists of eight execution
clusters where each cluster contains six ALUs resulting in a peak
execution rate of 48 arithmetic operations per second. Each cluster
executes a VLIW instruction under the control of the micro-controller.
The same VLIW instruction is issued to all clusters resulting in the
instruction fetch and control overhead being amortized over 8-way
SIMD execution. The bandwidth hierarchy consists of local register
files (LRF) attached to each function unit that provide 435GB/s, a
stream register file (SRF) that feeds the LRFs at 25.6 GB/s and a
streaming memory system that feeds the SRF at 2.1 GB/s. The LRFs within
each cluster are connected directly to function units but can accept
results from other function units over an intra-cluster switching
network. Each cluster also contains a communication unit that can
send and receive results from other units over an inter-cluster switching
network. The SRF is internally split into banks that serve each cluster.
In addition, each cluster also has a 256 word 32-bit scratchpad memory.
The host processor queues kernels for execution with the stream controller
via the host interface. The stream controller initiates stream loads
and stores via the stream memory system, uses an internal scoreboard
to ensure that dependences are satisfied, and then lets the micro-controller
sequence the execution of the next kernel function whose dependences
have all been satisfied. The Imagine processor was fabricated in a
CMOS process and achieved 7.96 GFLOPS and 25.4 GOPS at 200
MHz. Imagine was succeeded by the Merrimac project where the focus
was on developing a streaming super computer for scientific computing.
Both Imagine and Merrimac were developed primarily around the concept of time multiplexing. Thus they are optimized for applications with very high levels of data parallelism. In a multi-chip configuration, it is possible to space-time multiplex these systems by making kernels on each node communicate with their counterparts on other nodes over a network interface.