Equation 4.3 needs the quantity
, the probability of an acoustic vector sequence
given
a word sequence
to find the most probable word sequence. A simplistic
approach to achieve this would be to obtain several samples of each
possible word sequence, convert each sample to the corresponding acoustic
vector sequence and compute a statistical similarity metric for the
given acoustic vector sequence
to the set of known samples. For
large vocabulary speech recognition this is not feasible because the
set of possible word sequences is very large. Instead words may be
represented as sequences of basic sounds. Knowing the statistical
correspondence between the basic sounds and acoustic vectors, the
required probability can be computed.
The basic sounds from which word pronunciations can be composed are known as phones or phonemes. Approximately 50 phones may be used to pronounce any word in the English language. For example the CMU dictionary enlists the pronunciation for dissertation as:
The probability that an acoustic vector sequence corresponds to a
particular triphone may be estimated using a Hidden Markov Model (HMM).
Current speech recognizers use an HMM model with three internal states
and an entry and an exit state. The topology of the HMM is shown in
Figure 4.2. An HMM is a probabilistic finite state
machine that generates observation sequences. If the model is in state
at time step
, then it has a probability
of producing the acoustic vector
and it switches to state
with probability
. The problem of computing
now becomes what is known as the evaluation problem for HMMs - the
problem of estimating the probability with which a given HMM could
have generated the observation sequence
. The evaluation problem
can be solved using the Forward/Backward algorithm for HMMs, but since
the optimal state sequence is needed at a later stage, it is common
to do a more expensive Viterbi search which can compute the probability
and uncover the optimal state sequence simultaneously [80].