6.5.3 Bandwidth Savings
The Hub-4 speech model used in this study has 49,152 interleaved and padded Mean/ Var vectors each occupying three L2 cache lines of 128 bytes or a total of 384 bytes per pair of vectors. Thus the total size of the Gaussian table is 18 MB. Sphinx processes this table 100 times every second, but uses a subvector quantization heuristic to cut down the processing requirement, which in turn leads to lower DRAM bandwidth utilization. To guarantee real-time processing, the Gaussian accelerator may be used at a low power for brute force evaluation. Because of the blocking optimization GAU OPT, the data needs to be processed only 10 times per second with a peak bandwidth of 180 MB/s, which can be further reduced by applying the subvector quantization (nonfeedback) heuristics in Sphinx. Not only does this design bring the bandwidth requirements to limits possible on embedded systems, it also drastically improves the power consumption. On a 400 MHz Intel XScale development system where the processor itself consumes less than 1 W, peak memory bandwidth of 64 MB/s was obtained. Achieving this bandwidth consumed an additional 0.47 W. The factor of four or more bandwidth savings is significant for the embedded space since it indicates that a 52-watt server can be replaced by a 1-watt embedded processor.
The Gaussian coprocessor takes advantage of the simple loop structure and the limited precision requirements of the GAU algorithm to make real-time processing of speech signals possible at greatly reduced power budget. However, its design is quite inflexible and difficult to adapt to other algorithms like neural net evaluation which involve similar loops and summation operations. The experience underscores the potential benefits of programmable accelerators which can use domain specific optimizations to provide power and performance advantages similar to ASICs.