As CMOS technology scales, wire delays get slower when compared to
logic. The cluster interconnect reflects the belief that future architectures
will need to explicitly address communication at the ISA level. Traditional
architectures are based on implicit communication. For example the
MIPS instruction
depends on the hardware
to keep track of the last location where the operand
was present
and transfer it to where it is consumed. The location could be a renamed
register or a pipeline stage. In a wide issue clustered processor,
it is advantageous to have operands to a function unit be sourced
from nearby function units to hide the effects of long wire delays.
This is possible if communication is explicitly orchestrated by the
compiler. In the perception processor all communication is explicitly
orchestrated by the compiler. In the example above, the compiler would
pick a function unit to execute the
instruction, transfer
the output of the function unit that last produced the value corresponding
to the
operand to the
input of the chosen function unit,
transfer the constant
to the B input and schedule the actual
addition to happen the cycle when both inputs are available. In the
perception processor, pipeline registers at the interfaces of every
unit including function units and SRAM ports are named and accessible
to software. Data is explicitly transferred from the output pipeline
register of a producer to the input registers of its consumers. Unlike
traditional architectures where pipelines shift under hardware control,
a compiler for the perception processor can use clock gating to control
pipeline shifting and thereby control the lifetime of values held
in pipeline registers. This ensures that a result will be alive till
all its consumers have received a copy. This explicit management of
result lifetime and communication is called compiler controlled
data flow.
Explicit communication leads to the ability to overlap communication with computation with almost no hardware overhead. A significant number of bits in the wide microinstruction word are devoted to controlling the interconnect. While the interconnect can be controlled on a cycle by cycle basis, the compiler may elect to dedicate certain interconnect muxes to flows on a longer term basis. For example, while adding two vectors it is possible to dedicate separate interconnect muxes for the two operands for the duration of the vector addition. The compiler also attempts operand isolation, i.e., it tries to set unused muxes to states that reduce the amount of activity visible to the rest of the circuitry leading to lower power consumption.