As natural human interfaces become more common, scalability of servers that process speech will become an important issue. This will be particularly important for systems like call centers and collaborative work environments. In addition to having energy advantages, the design is also scalable. Figure 6.3 shows that the system can be scaled to process five independent speech channels in real time. The main limitation is the in-order processor with its simple blocking cache mode. This is evident from the difference in performance between the first and second bars in each data set. At six channels, the system is seen to be slightly slower than real time. However, an ideal L1 D-cache which always reports a cache-hit and never writes data back to memory is seen to scale up to 10 channels or more. A Final Sigma stage that implements log domain addition enables the design to scale even with blocking caches due to the removal of destructive interference between the cache and the DMA engine. The Final Sigma stage reduces the number of results that need to be stored in the cache by a factor of eight. With this optimization the system is able to process 10 or more channels of speech signals. For embedded designs, the power required to support multiple speech channels may be excessive, but such an organization is likely in a server. One channel of speech feature vectors contributes about 16 KB/s to the memory bandwidth. The outgoing probabilities consume 2.3 MB/s.
By setting a threshold on acceptable Gaussian scores and selectively sending out the scores, this can be significantly reduced. The dominant bandwidth component is still the Gaussian table. Additional Feat SRAMs and Gaussian accelerator datapaths may be included. Since the Gaussian tables are common for all channels, all datapaths can share the same Var and Mean SRAMs and thereby reuse the same 180 MB/s vector stream. With a higher frequency implementation of the Gaussian datapath, multiple channels can also be multiplexed on the same datapath. In a server, the Gaussian estimation of several channels can be delegated to a line card, which operates out of its own 18 MB Gaussian ROM. The partitioning of bandwidth, a 50% reduction in server workload per channel as well as reduced cache pollution leads to improved server scalability.