next up previous contents
Next: 4.3 The Cell Up: 4 Stream Processor Implementations Previous: 4.1 Imagine   Contents

4.2 RAW

The RAW processor is a wire delay exposed tiled architecture developed by Prof. Anant Agarwal and the Computer Architecture Group (CAG) at MIT as a part of the Oxygen ubiquitous computing project [17]. Increasing wire delays in sub-micron CMOS processes and the demand for high clock rates have created a need to decentralize control and resources and distribute resources as semi-autonomous clusters that avoid the need for single-cycle global communication. The RAW processor approaches this problem by splitting the die area into a square array of identical tiles and the tiles communicate with each other over a mesh network. Each tile contains an 8-stage in-order single issue MIPS-like processor with a pipelined FPU, 32 KB of instruction cache, 32 KB of data cache and routers for two static and two dynamic networks that transport 32-bit data. The routers have another 64 KB of instruction cache. Point to point transport of scalar values is done over the high performance static network that is fully compiler controlled and guarantees in-order operand delivery. The dynamic network routes operations such as I/O, main memory traffic and inter-tile message passing that are difficult to fully schedule statically. The static router controls two cross bars each with seven inputs namely the four neighboring tiles in the square array, the router pipeline itself, the other crossbar and the processor. For tiles on the periphery of the chip, some of the links connect to external interfaces. The tiles and the static router are designed for single cycle latency between hops. The compiler encodes the routing decisions for the crossbars into a 64-bit instruction that is fetched from a 64KB instruction cache and executed by the static router. Inter-tile communication latency is reduced by integrating the network with the bypass paths of the processor. A 225 MHz implementation of a 16 tile RAW processor was fabricated in a $0.18\mu$ CMOS process and achieved speedups of 4.9 to 15.4 over a 600 MHz Pentium 3 for a set of stream oriented benchmarks written in the StreamIt language.

Figure 5: The RAW Processor
\includegraphics[%
width=3in,
keepaspectratio]{figures/raw.eps}

Because of its independent threads of execution in each tile, the RAW processor is capable of performing time, space and space-time multiplexing. The StreamIt language mostly exposes a space multiplexed programming model even though the compiler is capable of partitioning kernels and load balancing them for space-time multiplexing.


next up previous contents
Next: 4.3 The Cell Up: 4 Stream Processor Implementations Previous: 4.1 Imagine   Contents
Binu K. Mathew, Ali Ibrahim