The signal processing front end summarizes the spectral characteristics of the speech waveform into a sequence of acoustic vectors that are suitable for processing by the acoustic model. Figure 4.1 shows the stages of this transformation.
Frame Blocking: The digitized speech signal is blocked into overlapping frames. It is common to have 100 frames per second, so a new frame is started every 10 ms. A new frame contains the last 7.5 ms of the previous frame's data and the first 7.5 ms of the next frame's data. Thus, even though a new frame is made every 10 ms, each frame is 25 ms in duration. The overlap decreases problems that might otherwise occur due to signal data discontinuity.
Preemphasis: This stage spectrally flattens the frame using
a first order filter. The transformation may be described as:
Hamming Window: In this stage a Hamming window is applied to the frame to minimize the effect of discontinuities at the edges of the frame during FFT. The transformation is:
FFT: The frame is padded with enough zeroes to make the frame
size a power of two (call this
) and a Fourier transform is used
to convert the frame from the time domain to the frequency domain.
Mel Filter Bank: A set of triangular filter banks is used
to approximate the frequency resolution of the human ear. The Mel
frequency scale is linear up to 1000 Hz and logarithmic thereafter.
A set of overlapping Mel filters are made such that their center frequencies
are equidistant on the Mel scale. The transformation is:
For 16 KHz sampling rate, Sphinx uses a set of 40 Mel filters.
Log Compression: The range of the values generated by the
Mel filter bank is reduced by replacing each value by its natural
logarithm. This is done to make the statistical distribution of the
spectrum approximately Gaussian - a requirement for the subsequent
acoustic model. The transformation is:
DCT: The discrete cosine transform is used to compress the
spectral information into a set of low order coefficients. This representation
is called the Mel-cepstrum. Currently Sphinx compresses the 40 element
vector
into a 13 element cepstral vector. The transformation
is:
Numerical differentiation: Acoustic modeling assumes that each acoustic vector is uncorrelated with its predecessors and successors. Since speech signals are continuous, this assumption is problematic. The traditional solution is to augment the cepstral vector with its first and second differentials. Since the Mel cepstral vector is 13 elements long in Sphinx, after appending the differentials the final acoustic vector that is 39 elements in length.
Summary: The Sphinx front end transforms a 25 ms speech sample into a 39 element vector of real numbers that represents the spectral characteristics of the waveform in a compact form. The speech signal is blocked into overlapping frames spaced 10 ms apart. Thus the front end transforms one second of speech into a series of 100 acoustic vectors. Even though the front end only occupies less than 1% of the compute cycles of Sphinx 3.2, it is very important for two reasons.