Pihl at the Norwegian University of Science and Technology designed
the PDF coprocessor, a custom coprocessor in a 0.8
CMOS process
to accelerate the computation of Gaussian observation probabilities
in a hidden Markov model based speech recognizer [77]. This
research concluded that memory bandwidth was a limiting factor for
Gaussian computation. Pihl approached the memory bandwidth problem
by using a new fixed point representation called the dynamical circular
fixed-point format, which reduced the memory bandwidth requirement
by half. The PDF coprocessor could evaluate 40,000 39-element Gaussian
components in real time using this format at 154 MHz consuming 853
mW of power. The work was based on an early version of Sphinx. In
the current Sphinx 3.2 version, the workload has worsened by a factor
of 15.3. This number, as well as the bandwidth requirement, is expected
to increase further in the future.
An earlier attempt to accelerate speech recognition may be found in the work of Anatharaman and Bisiani [10]. They present a custom architecture as well as a multiprocessor architecture for improving the performance of the beam search algorithm used by the CMU distributed speech recognition system.
Benedetti and Perona describe an FPGA based system that exploits memory locality for real-time low level vision [13]. Their system targeted the fast prototyping of low level vision techniques using observations about locality in pixel neighborhoods to achieve 2.8 GBytes/second bandwidth between SRAM components and FPGA based compute elements.