5.1 Memory System Behavior
Figures 5.1 and 5.2 show the L1 Dcache and L2 cache miss rates for original, phased, FE, HMM and GAU for a variety of configurations. Since earlier studies showed that larger line sizes benefit Sphinx II, 64 byte L1 and 128 byte L2 cache line sizes were chosen . In addition, the L2 cache experiments assume a 32 KB L1 Dcache. Both figures assume an 8 KB Icache. Since Sphinx has an extremely low instruction cache miss rate of 0.08% for an 8 KB Icache, no other Icache experiments were done. The SGI data provide a reality check since they represent results obtained using hardware performance counters. The SGI L2 results are very similar in character to the 8 MB simulation results in spite of the effects of out of order execution, memory system latency and differences in cache replacement policy. The L1 results are not directly comparable since the R12000 uses a 32 byte L1 line size and suffers from cache pollution induced by abundant DTLB misses.
Figure 5.3 shows the average bandwidth required to process the workload in real time. This is obtained by dividing the total L2 to memory traffic while Sphinx operates on a speech file by the duration in seconds of the speech signal. The evidence suggests that bandwidth starvation leading to stalls on L2 misses is the reason this application is not able to meet real-time requirements. The memory bandwidth required for this application is several times higher than what is available in practice. Note that available bandwidth is always significantly less than the theoretical peak on most architectures. A 16-fold improvement in L2 size from 256 KB (the L2 size of a 1.7 GHz Athlon) to 8 MB (SGI Onyx) produces only a very small decrease in the bandwidth requirement of GAU. This phase essentially works in stream mode making 100 sequential passes per second over a 14 MB Gaussian table. The speech signal itself contributes only 16 KB/s to the total bandwidth requirements. Some computation saving heuristics in Sphinx also have the beneficial side effect of helping to save bandwidth by not touching blocks that are deemed improbable. Until the L2 size reaches 8 MB, long term reuse of Gaussian table entries in the L2 is infrequent. It should be noted that the bandwidth requirement of GAU in isolation is more severe than if it were operating inside original, since feedback driven heuristics cannot be applied.