The datapath shown in Figure 6.2 was
implemented using a datapath description language (Synopsys Module
Compiler Language) and is subsequently synthesized for a
CMOS process. The control sections were written in Verilog and synthesized
using the Synopsys Design Compiler. The gate level netlist is then
annotated with worst case wire loads calculated using the same wire
load model used for synthesis. The netlist is then simulated at the
Spice level using Synopsys Nanosim and transistor parameters extracted
for the same
MOSIS process. Energy consumption is estimated
from the RMS supply current computed by Spice. The unoptimized fully
pipelined design can operate above 300 MHz at the nominal voltage
of 2.5 volts with unit initiation interval. At this frequency the
performance exceeds the real-time requirements for GAU, indicating
an opportunity to further reduce power. A lower frequency and voltage
can be used to further reduce power.
A low power processor similar to a MIPS R4600 was designed for use as a control processor. The MIPS was chosen because it is commonly used in embedded systems and also because high performance implementations of the MIPS ISA, like the R12K, were readily available for experiments. The design of this processor was done in such a way that it could be easily modified for tight integration with ASIC coprocessors. The Gaussian accelerator was designed and attached to the control processor as a custom coprocessor, and the combination was then simulated. The control processor is a simple in-order design that uses a blocking L1 Dcache and has no L2 cache. To support the equivalent of multiple outstanding loads, it uses the MIPS coprocessor interface to directly submit DMA requests to a low priority queue in the on-chip memory controller. The queue supports 16 outstanding low priority block read requests with block sizes that are multiples of 128 bytes. A load request specifies a ROM address and a destination - one of the Feat, Mean or Var SRAMs. The memory controller initiates a queued memory read and transfers the data directly to the requested SRAM index. A more capable out of order processor could initiate the loads directly. Software running on the processor core does the equivalent of the GAU OPT phase. It accumulates 100 ms or 10 frames of speech feature vectors (1560 bytes) into the Feat SRAM whenever the accelerator has finished processing the previous block of input. Currently, the accelerator functions faster than its real-time requirement. It is possible to slow down the accelerator so that it completes the processing of each block just by the time the next block of input is ready, but this has not been attempted. The data transfer uses the memory controller queue interface. Next, it loads two interleaved Mean/Var vectors from ROM into the corresponding SRAM using the queue interface. A single transfer in this case is 640 bytes. The Mean/Var SRAM is double buffered to hide the memory latency. Initially, the software fills both the buffers. It then queues up a series of vector execute commands to the control logic of the Gaussian accelerator. A single command corresponds to executing the interchanged loops 2,3,1. The processor then proceeds to read results from the output queue of the Gaussian accelerator. When 10 results have been read, it is time to switch to the next Mean/Var vector and refill the used up half of the Mean/Var SRAM. This process continues until the end of the Gaussian ROM is reached. When one cache line of results has been accumulated, they are written to the output queue where another phase or an I/O interface can read them.
Calculations based on the throughput of the accelerator showed that
it needed to operate at 202 MHz to achieve real-time speech processing.
To simplify the electrical interface between the processor and the
coprocessor, both circuits need to operate at the same clock frequency.
Since the processor runs a general purpose operating system, events
like clock ticks and background tasks sometimes interrupt the main
program that transfers data between main memory and the input and
output queues. Additional head-room is required so that these interruptions
do not prevent real-time processing of the speech data. The extra
performance required from the processor depends on the mix of control
tasks running on the processor. When the accelerator is scaled to
process multiple channels the processor needs to have commensurate
processing ability too. So the operating frequency of the system was
chosen to be as high as possible subject to the limitations of the
process. The maximum frequency at which the circuits were
stable was 300 MHz. A cycle accurate simulator was developed and validated
by running it in lock step with the processor's HDL model. The simulator
was detailed enough to boot the SGI Linux 2.5 operating system and
run user applications in multitasking mode. The resulting system accurately
models the architecture depicted in Figures 6.2
and 6.1. The GAU OPT application
for this system is a simple 250 line C program with fewer than 10
lines of assembly language for the coprocessor interface. Loop unrolling
and double buffering were done by hand in C. The application was compiled
using MIPS GCC 3.1 and run as a user application under Linux inside
the simulator. It was able to process 100 ms samples of a single channel
in 67.3 ms and scale up to 10 channels in real time. The actual data
may be seen in Section 6.5.2.