A basic understanding of the acoustic and language models is necessary to understand the architectural implications and scaling characteristics of speech recognition. The lexical tree is a complex data structure that results in considerable pointer chasing at run time. The nodes that will be accessed depend very much on the sentences being spoken. The size of the tree depends on the vocabulary size. However there is scope for architectural optimization. The opportunity stems from the fact that acoustic vectors are evaluated successively and on evaluating an HMM for the current vector, if the HMM generates a probability above a certain threshold, the successors of the HMM will be evaluated in the next time step. Thus there is always a list of currently active HMMs/lextree nodes and a list of nodes that will be active next. Evaluating each HMM takes a deterministic number of operations and thus a fixed number of clock cycles. This information can be used to prefetch nodes ahead of when they are evaluated.
Given the fact that the number of triphones and words in a language
are relatively stable, it might appear that the workload will never
expand. In reality this is not the case due to the probability density
function
. In the past, speech recognizers used subvector
quantized models, which are easy to compute. These methods use a code
book to store reference acoustic vectors. Acoustic vectors obtained
from the front end are compared against the code book to find the
index
of the closest match. The probability density function
then reduces to a table lookup of the form
. While this
is computationally efficient, the discretization of observation probability
leads to excessive quantization error and thereby poor recognition
accuracy.
To obtain better accuracy, modern systems use a continuous probability density function and the common choice is a multivariate mixture Gaussian in which case the computation may be represented as:
Here,
is the mean and
the variance of the Gaussian
mixture and
is the weight of the mixture. For The Hub-4
speech database used for this research was obtained from CMU and they
chose
and
to be 8 and 39 respectively. Note that the outer
denotes an addition in the logarithmic domain. Normally
the inner term involves exponentiation to compute a weighted Mahalanobis-like
distance, but it is reduced to simple arithmetic operators by keeping
all the parameters in the logarithmic domain [91,111].
Therefore the outer summation needs to be done in the logarithmic
domain. This may be implemented using table lookup based extrapolation.
This strategy is troublesome if the processor's L1 D-cache is not
large enough to contain the lookup table.
If each HMM state uses a separate probability density function, then the system is said to be fully continuous. Thus the peak workload for an English speech recognizer would correspond to the evaluation of about 60,000 probability density functions and HMMs, as well as an associated lextree traversal that is proportional to the number of words in the vocabulary. Fully continuous models are not popular for two reasons:
Though traditional speech recognizers couple the evaluation of HMMs and Gaussians tightly, in the interest of extracting greater levels of thread parallelism, it is possible to decouple HMM and Gaussian evaluation, an approach that will be further investigated in Chapter 5.