Modern approaches to large vocabulary continuous speech recognition
are surprisingly similar in terms of their high-level structure [111].
The work described herein is based on the CMU Sphinx 3.2 system, but
the general approach is applicable to other speech recognizers [49,74].
The explanation of large vocabulary continuous speech recognition
(LVCSR) in this chapter is based on a simple probabilistic model presented
in [80,111]. The human vocal apparatus has mechanical
limitations that prevent rapid changes to sound generated by the vocal
tract. As a result, speech signals may be considered stationary, i.e.,
their spectral characteristics remain relatively unchanged for several
milliseconds at a time. DSP techniques may be used to summarize the
spectral characteristics of a speech signal into a sequence of acoustic
observation vectors. Typically, 100 such vectors will be used to represent
one second of speech. Speech recognition then becomes a statistical
problem of deriving the word sequence that has the highest likelihood
of corresponding to the observed sequence of acoustic vectors. This
notion is captured by the equation: