The first two algorithms called GAU and HMM described in Chapter 4 are dominant components of the Sphinx 3.2 speech recognizer. The next five algorithms named Rowley, Fleshtone, Erode, Dilate and Viola are components of the visual feature recognition system described in Chapter 7. The last three algorithms are FFT, FIR and Rijndael and these are taken from the DSP and encryption domains. The DSP algorithms were added to test the generality of our approach. DSP functions like FFT and FIR are important components of speech recognition front ends and image processing algorithms. Encryption is of increasing importance to secure embedded systems. Rowley, GAU, FFT and Fleshtone are floating point intensive. The remaining benchmarks are integer only computations. Some components of GAU, Rowley and Fleshtone may be vectorized while the rest of the algorithms cannot. HMM is intensive in data dependent branches which may be if-converted.
Several source level optimizations have been made to the software versions that run on the Pentium and XScale to boost their performance as much as possible [66]. The optimizations included hand unrolled loops, partial specialization of functions when some arguments are known statically, replacing expensive functions with table lookups, reshaping data structures for better cache locality and a variety of algorithm optimizations discussed in Chapters 5 and 7. No SIMD optimizations were made in order to keep the comparison fair. The perception processor could use SIMD floating point units, just like SSE on the Pentium, but widening datapaths makes isolating the impact of architectural options like compiler controlled dataflow impossible. A brief description of the benchmarks follow.
GAU and HMM represent Gaussian probability density evaluation and hidden Markov model evaluation respectively. GAU occupies 57.5% and HMM consumes 41.5% of the execution time of the Sphinx 3.2 speech recognition system. Both Gaussian distributions and hidden Markov models are components of most mature speech recognizers [59,111,91]. GAU computes how closely a 10 ms frame of speech matches a known Gaussian probability distribution. One input packet corresponds to evaluating a single acoustic model state over 10 frames of a speech signal. A real-time recognizer needs to process 600,000 invocations of the GAU algorithm every second. The HMM algorithm performs a Viterbi search over a hidden Markov model corresponding to one model state. One input packet to the HMM implementation consists of 32 five-state hidden Markov models. While the GAU algorithm is entirely floating point, the HMM algorithm is dominated by integer compare and select operations. Its average rate of invocation varies significantly with context, but to guarantee real-time performance it is assumed in this research that all HMM models are evaluated thereby brute forcing a large component of speech processing.
Rowley represents a neural network based visual feature detector
[83]. In the face recognizer a multilayer neural
network is swept over
rectangular regions of an image.
Each individual neuron is evaluated by the function:
.
Neurons have multiple sizes for their fan-in (
), and each layer
depends on the preceding layer's output. The software implementation
developed for this dissertation used hand unrolled, specialized versions
of neuron evaluation functions for each input size. Also,
was implemented via table lookup whereas Rowley's original implementation
used the
function in the C library. This optimization boosted
the Pentium's performance by a factor of 2.5. A
image
as well as the outputs of all the neurons are maintained within the
perception processor. Depending on the sizes of the neurons an input
packet consisting of the weights and connections of 7 to 64 neurons
is streamed through the perception processor. All computations involve
single precision floating point numbers.
Fleshtone represents a skin toning algorithm typically used
as a preprocessing step to find skin colored regions of an image so
that a more sophisticated object detector like the Rowley detector
may be applied to it. The benchmark converts RGB pixels to another
color space and checks if the projected pixel falls in between two
parabolic curves [90]. This algorithm represents a case
that is difficult to vectorize since there are far more floating point
operators per pixel than the number of FPUs present in the cluster.
This necessitates multiple passes and saving of intermediate results.
It also contains multiple if statements in the body. Each input packet
consists of a single raster line of a
24-bit color
image. The output is a 320-entry bitmap whose elements are set where
flesh color is found.
Erode and Dilate represent two operators from
mathematical morphology that help in image segmentation. Erode sweeps
a
pixel filter over the bitmap produced by Fleshtone and
cuts away weakly connected regions, i.e., it blacks out pixels if
all pixels within the filter are not set. Dilate does the opposite,
it sweeps a
pixel filter over a bitmap and fills in pixels
if any of the pixels are set. Fleshtone, Erode and Dilate are used
for image segmentation in a visual feature recognition system [65].
Erode works on three raster lines and dilate works on five raster
lines of a
image.
Viola is a reimplementation of the Viola and Jones' method
of object detection based on a well known machine learning algorithm
known as AdaBoost [103]. The algorithm relies on computing
features or wavelets which are the weighted sum or difference of rectangular
regions within a
window into an image. The coordinate
and weight information for 100 features are maintained within the
perception processor. Each input packet contains a
pixel
image. The output contains the evaluation of all 100 features over
the
image.
FFT implements a 128 point complex to complex Fourier transform on floating point data. The Fourier coefficients are maintained within the perception processor. Input and output packets consist of 128 complex numbers where each complex number consists of two single-precision floating point numbers. FFT represents a common algorithm for which many DSP processors implement ISA extensions. FFT also represents a case that causes bad interconnect conflicts on our architecture. Good performance depends on the interconnect borrowing technique described in Section 9.5. The software version on the Pentium is based on FFTW, a highly tuned FFT implementation which used dynamic programming techniques to adapt itself to the processor architecture [38]. The microcode implementation on the other hand uses a simple radix-2 algorithm and no ISA extensions. Since FFTW cannot be used on the XScale, the simple radix-2 algorithm is used instead.
FIR is a 32 tap finite impulse response filter, a common primitive in DSP applications. Impulse response coefficients are maintained inside the perception processor. Input packets of various sizes may be applied to the filter, which successively evaluates each input and outputs one integer corresponding to every input word.
Rijndael is the advanced encryption standard. The particular version implemented here uses 128 bit keys and works on 16 byte blocks [29]. Input blocks are 576 bytes long to simulate network level encryption. The default maximum size of Internet packets is 576 bytes. The key as well as the encryption S-boxes are maintained within the perception processor.