The RAW processor is a wire delay exposed tiled architecture developed
by Prof. Anant Agarwal and the Computer Architecture Group (CAG) at
MIT as a part of the Oxygen ubiquitous computing project [17].
Increasing wire delays in sub-micron CMOS processes and the demand
for high clock rates have created a need to decentralize control and
resources and distribute resources as semi-autonomous clusters that
avoid the need for single-cycle global communication. The RAW processor
approaches this problem by splitting the die area into a square array
of identical tiles and the tiles communicate with each other over
a mesh network. Each tile contains an 8-stage in-order single issue
MIPS-like processor with a pipelined FPU, 32 KB of instruction cache,
32 KB of data cache and routers for two static and two dynamic networks
that transport 32-bit data. The routers have another 64 KB of instruction
cache. Point to point transport of scalar values is done over the
high performance static network that is fully compiler controlled
and guarantees in-order operand delivery. The dynamic network routes
operations such as I/O, main memory traffic and inter-tile message
passing that are difficult to fully schedule statically. The static
router controls two cross bars each with seven inputs namely the four
neighboring tiles in the square array, the router pipeline itself,
the other crossbar and the processor. For tiles on the periphery of
the chip, some of the links connect to external interfaces. The tiles
and the static router are designed for single cycle latency between
hops. The compiler encodes the routing decisions for the crossbars
into a 64-bit instruction that is fetched from a 64KB instruction
cache and executed by the static router. Inter-tile communication
latency is reduced by integrating the network with the bypass paths
of the processor. A 225 MHz implementation of a 16 tile RAW processor
was fabricated in a
CMOS process and achieved speedups
of 4.9 to 15.4 over a 600 MHz Pentium 3 for a set of stream oriented
benchmarks written in the StreamIt language.
Because of its independent threads of execution in each tile, the RAW processor is capable of performing time, space and space-time multiplexing. The StreamIt language mostly exposes a space multiplexed programming model even though the compiler is capable of partitioning kernels and load balancing them for space-time multiplexing.