# 4.2 Acoustic Model

Equation 4.3 needs the quantity , the probability of an acoustic vector sequence given a word sequence to find the most probable word sequence. A simplistic approach to achieve this would be to obtain several samples of each possible word sequence, convert each sample to the corresponding acoustic vector sequence and compute a statistical similarity metric for the given acoustic vector sequence to the set of known samples. For large vocabulary speech recognition this is not feasible because the set of possible word sequences is very large. Instead words may be represented as sequences of basic sounds. Knowing the statistical correspondence between the basic sounds and acoustic vectors, the required probability can be computed.

The basic sounds from which word pronunciations can be composed are known as phones or phonemes. Approximately 50 phones may be used to pronounce any word in the English language. For example the CMU dictionary enlists the pronunciation for dissertation as:

- [DISSERTATION] D IH S ER T EY SH AH N

The probability that an acoustic vector sequence corresponds to a particular triphone may be estimated using a Hidden Markov Model (HMM). Current speech recognizers use an HMM model with three internal states and an entry and an exit state. The topology of the HMM is shown in Figure 4.2. An HMM is a probabilistic finite state machine that generates observation sequences. If the model is in state at time step , then it has a probability of producing the acoustic vector and it switches to state with probability . The problem of computing now becomes what is known as the evaluation problem for HMMs - the problem of estimating the probability with which a given HMM could have generated the observation sequence . The evaluation problem can be solved using the Forward/Backward algorithm for HMMs, but since the optimal state sequence is needed at a later stage, it is common to do a more expensive Viterbi search which can compute the probability and uncover the optimal state sequence simultaneously [80].

Binu Mathew