4.5 Architectural Implications
A basic understanding of the acoustic and language models is necessary to understand the architectural implications and scaling characteristics of speech recognition. The lexical tree is a complex data structure that results in considerable pointer chasing at run time. The nodes that will be accessed depend very much on the sentences being spoken. The size of the tree depends on the vocabulary size. However there is scope for architectural optimization. The opportunity stems from the fact that acoustic vectors are evaluated successively and on evaluating an HMM for the current vector, if the HMM generates a probability above a certain threshold, the successors of the HMM will be evaluated in the next time step. Thus there is always a list of currently active HMMs/lextree nodes and a list of nodes that will be active next. Evaluating each HMM takes a deterministic number of operations and thus a fixed number of clock cycles. This information can be used to prefetch nodes ahead of when they are evaluated.
Given the fact that the number of triphones and words in a language are relatively stable, it might appear that the workload will never expand. In reality this is not the case due to the probability density function . In the past, speech recognizers used subvector quantized models, which are easy to compute. These methods use a code book to store reference acoustic vectors. Acoustic vectors obtained from the front end are compared against the code book to find the index of the closest match. The probability density function then reduces to a table lookup of the form . While this is computationally efficient, the discretization of observation probability leads to excessive quantization error and thereby poor recognition accuracy.
To obtain better accuracy, modern systems use a continuous probability density function and the common choice is a multivariate mixture Gaussian in which case the computation may be represented as:
Here, is the mean and the variance of the Gaussian mixture and is the weight of the mixture. For The Hub-4 speech database used for this research was obtained from CMU and they chose and to be 8 and 39 respectively. Note that the outer denotes an addition in the logarithmic domain. Normally the inner term involves exponentiation to compute a weighted Mahalanobis-like distance, but it is reduced to simple arithmetic operators by keeping all the parameters in the logarithmic domain [91,111]. Therefore the outer summation needs to be done in the logarithmic domain. This may be implemented using table lookup based extrapolation. This strategy is troublesome if the processor's L1 D-cache is not large enough to contain the lookup table.
If each HMM state uses a separate probability density function, then the system is said to be fully continuous. Thus the peak workload for an English speech recognizer would correspond to the evaluation of about 60,000 probability density functions and HMMs, as well as an associated lextree traversal that is proportional to the number of words in the vocabulary. Fully continuous models are not popular for two reasons:
- Their computational complexity makes them orders of magnitude slower than real time on current processors.
- Their parameter estimation problem and sparse training sets lead to low recognition accuracy.
Though traditional speech recognizers couple the evaluation of HMMs and Gaussians tightly, in the interest of extracting greater levels of thread parallelism, it is possible to decouple HMM and Gaussian evaluation, an approach that will be further investigated in Chapter 5.