HMMs are constructed for all known triphones. A pronunciation dictionary
is used to convert words into triphone sequences with overlapping
contexts. For example the isolated word dissertation whose pronunciation
is the phone sequence D IH S ER T EY SH AH N is expanded to SIL-D+IH,
IH-S+ER, S-ER+T, ER-T+EY, T-EY+SH, EY-SH+AH, SH-AH+N, AH-N+SIL. There
are many more expansions corresponding to all words that could possibly
precede or succeed this word in a sentence. These are words that could
end in D+IH or start with AH-N. A data-structure known as a lexical
tree (Sphinx terminology) is constructed, and all words in the dictionary
are entered in the lexical tree. The roots of the tree correspond
to the set of all triphones that start any word in the dictionary.
Each node in the tree points to the next triphone in the expanded
pronunciation of a word. Common triphone sequences may be shared within
the tree. The overall effect is that of combining all the triphone
HMMs by adding null transitions between the final states of one triphone
HMM to the initial state of its successor. To model continuous speech,
null transitions are added from the final state of each word to the
initial state of all words. Triphones that occur at the end of a word
are specially marked so that a language model may be consulted at
those points. Thus the lexical tree is a multirooted tree where each
node points to an HMM and a successor node. In the case of word exit
triphones there are multiple successors. Given an acoustic vector
sequence
, each vector in the sequence is applied successively
to the HMMs and the probability that the HMM generated that vector
is noted. Transitions are made in each step to successor nodes. On
reaching a word exit triphone, the state sequence history is consulted
to find the word that has been recognized. The last n words (usually
n=3) are checked against a language model for further analysis. The
search is done by means of a well known dynamic programming algorithm
known as Viterbi beam search [74]. The acoustic and language
models are strongly coupled, though language model evaluation may
be deferred until the acoustic model has been evaluated. Together,
they consume almost 99% of the run time of Sphinx.