5.3.1 Cache Optimizations
In Section 5.1, GAU was shown to be bandwidth starved. The GAU code in phased was instrumented and found to require approximately twice the amount of computation as in original. However, Figure 5.6 shows that phased suffers only 0.85 times slow down over original on an R12000. Clearly, a large fraction of the excess computation is hidden by memory latency. With processor to memory speed ratios increasing in the future, an out of order processor can hide an even larger amount of compute overhead. The key is to improve the memory system behavior without an unreasonable increase in compute requirements.
To achieve this goal, two transformations were performed on phased. First, a blocking optimization similar in spirit to loop tiling was performed, which delays the initial speech signal by 100 ms or 10 frames. The Gaussian probabilities for all 10 frames are computed by making a single pass over the Gaussian tables. This effectively reduces the number of passes to 10 per second where original would have done 100. The blocking factor is limited to 10 to avoid a perceptible real-time lag at the decoder output.
It should be noted that this is not a blocking or tiling transformation that a compiler could perform. The software had to be restructured to accumulate 10 frames of the speech signal and to process 10 frames in one pass. Further, this became possible only because the feedback between HMM and GAU was eliminated. Speech researchers advancing the state of their art are unlikely to be interested in or aware of architectural level implications. Thus, it is imperative that architecture researchers analyze the performance implications of important perception applications like speech recognition.
Sphinx allocates the mean and variance vectors used for Gaussian computation described in Section 4.5 separately. Every component evaluation consumes one mean and one variance vector. Since Sphinx originally allocated each table of vectors separately and each is more than 7 MB, they potentially conflict with each other in the cache. To avoid this, corresponding mean and variance vectors were interleaved and padded with an additional 64 bytes to be exactly three L2 cache lines long. This padding strategy consumes bandwidth but simplifies DMA transfers for the coprocessor architecture described later. The optimized version appears in Figure 5.7. Note the interleaving of vectors and a blocking loop that is not present in Equation 4.4. The optimized version appears in Figures 5.1, 5.2, 5.3 and 5.6 as the data point GAU OPT.
GAU OPT demonstrates the true streaming nature of GAU. Figure 5.3 shows that GAU OPT uses a factor of 4.7 to 3.9 less bandwidth than GAU in simulation with a factor of 4.2 improvement obtained on a real machine. This supports the claim that GAU processing can be done without an L2 cache. With a 256 KB L2 cache, the GAU OPT bandwidth is 174 MB/s. Calculations show that without a heuristic, and without an L2 cache, GAU OPT can meet its real-time requirements with 180 MB/s of main memory bandwidth. This has important implications for the scalability of servers that process speech.
Figures 5.1 and 5.2 show dramatic reduction in the cache miss rates in both simulation and native execution. The L2 native execution results are better than simulation results. The large variation in the L1 results is due to the 32 byte L1 line size on the R12000 and also possibly because of an extremely large number of TLB misses. The software TLB miss handler could easily pollute the L1 cache. The important point is that Figure 5.6 shows that OPT, a version of phased with the GAU OPT blocking optimization, achieves a slight speedup over original despite performing a larger number of computations. In summary, to be able to extract parallelism, the feedback loop was broken, which approximately doubled the GAU workload. With cache optimizations (which are not possible with feedback), the loss due to the extra GAU workload is recovered and the exposed parallelism is now open for further optimization.