Chapter 4 introduced the use of multivariate mixture Gaussians in the acoustic model evaluation of Sphinx 3.2 and indicated that this computation is common to other speech recognition systems like HTK and the ICRC recognizer [59,111]. Chapter 5 showed that 55.5% of the execution time of Sphinx 3.2 was spent in Gaussian computation when using the Hub-4 speech model. The high percentage of execution time spent in this computation together with its applicability to a variety of speech recognizers argues for special acceleration hardware for mixture Gaussians. Accelerators may be implemented as custom nonprogrammable circuits or as domain specific programmable processors. The custom circuit option will represent a practical upper bound on achievable performance and energy efficiency. The programmable option which sacrifices some performance and energy to gain generality will be explored in Chapter 9. This chapter describes how a high throughput custom datapath is able to achieve area, power and bandwidth efficiency as well as scalability by means of:
Earlier work by Pihl explored the use of special-purpose floating
point formats in Gaussian estimation to save memory bandwidth [77].
Special floating point formats should be almost invisible to the application
so that speech models may be developed without access to any special
hardware. A custom software floating point emulation library was developed
to conduct an empirical search for the precision requirements of the
GAU phase. The library supported multiplication, addition, MAC, and
operations on IEEE 754 format floating point numbers.
The approach was to experimentally reduce mantissa and exponent sizes
without changing the output results of the Sphinx 3 recognizer. The
result was a reduced precision floating point format similar to the
IEEE 754 format which has a sign-bit, an 8-bit excess 127 exponent
and a hidden one-bit in its normalized mantissa. Unlike IEEE 754,
which has 23 explicit-bits in the mantissa, the new format used only
12 bits. Conversion between the reduced precision representation and
IEEE 754 was done by truncating the extra mantissa bits when converting
from IEEE 754 to the new format and concatenating additional 0 bits
when converting from the new format to IEEE 754. Such a transformation
can be done within a floating point unit without any changes being
visible to the application. Though this work was done independently,
it is worthwhile to note that a previous study arrived at similar
conclusions based on an earlier version of Sphinx [97].
However that research used digit serial multipliers, which cannot
provide the kind of throughput required for GAU computation. Hence
the accelerator discussed here uses fully pipelined reduced precision
multipliers instead.
Another key insight is that current high performance microprocessors
provide a fused multiply add operation that would benefit GAU. However,
GAU also needs an add multiply (subtract-square) operation. There
is scope for floating point circuit improvements relying on the nature
of
always returning a positive number. Further gains
can be obtained in area, latency, power and the magnitude of the numerical
error by fusing the operations
. This is the approach
used in this research.