6.1 Top Level Organization
Figure 6.1 illustrates the system context for the GAU accelerator. Figure 6.2 shows the details of the accelerator itself. Loops 1, 2 and 3 from the optimized GAU algorithm in Figure 5.7 are implemented in hardware. The outer loop and the log_add step, which consists of integer subtract, table lookup and integer add, are implemented in software. The max operation can be folded into the de-normal floating point number handling section of the floating point adder without additional latency, but empirically it can be discarded without sacrificing recognition accuracy. The organization in Figure 6.1 is essentially a decoupled access/execute architecture . The outer loop runs on a host processor and instructs a DMA engine to transfer X, Mean and Var vectors into the accelerator's input memory. A set of 10 input blocks are transferred into the accelerator memory and retained for the duration of a pass over the entire interleaved Mean/Var table. The Mean/Var memory is double buffered for simultaneous access by the DMA engine and the accelerator. The accelerator sends results to an output queue where they are read by the host processor using its coprocessor access interface.