2.3 High ILP Processors for Perception
The high performance microprocessor industry has devoted a lot of attention to developing short vector (SIMD) extensions like MMX, SSE, MDMX and VIS that cater to the needs of multimedia applications [26,37]. An Intel publication described the use of SSE II instructions for Viterbi decoding of hidden Markov models . Significant performance improvement is claimed, but not quantified. The Intel computer vision library provides SIMD optimized versions of commonly used vision algorithms . Though vector machines have long been the workhorse of scientific computing, the relevance of short vector or SIMD optimizations to perception codes had not been appreciated fully until recently. These techniques have been shown to improve performance by up to an order of magnitude on DSP style algorithms and even on small speech processing codes . The trend has in general been to use short vectors to utilize SIMD parallelism and to use the super-scalar scheduling infrastructure already available in modern out of order processors to keep the SIMD units occupied rather than using real vector issue and long vectors . Shifting the task of identifying dependences and scheduling instructions from a vectorizing compiler to dynamic issue logic has the distinct disadvantage of increasing processor complexity as well as power consumption. Vector chaining has been traditionally used as a performance enhancement mechanism . The compiler controlled dataflow approach developed in this dissertation can mimic vector chaining in a more general manner and with low hardware overhead.
There have been numerous attempts to implement digital neural network processors as vector or SIMD machines. CNAPS from Adaptive Systems and the NeuroMatrix DSP from Module Research Center are representative examples [44,72]. While neural network algorithms have been a mainstay of perception research, the evaluation of such architectures for well defined perception tasks or whole perception applications is rarely found in the literature. A well known example is SPERT, a neural network and signal processing accelerator board for workstations, based on the Torrent 0 vector microprocessor jointly designed by the International Computer Science Institute and UC Berkeley . Evaluation of SPERT focused on training of forward and back-propagation neural networks for tasks like probability estimation for a hidden Markov model based speech recognizer. Both processor speed and the complexity of the recognition task have increased greatly since the time of SPERT.
The performance of Multi-SPERT, a later design consisting of multiple SPERT boards was measured to be over 530 million connection updates per second for a five node configuration performing neural network training for speech recognition . Moreto analyzed SPERT's performance on a partial implementation of RASTA-PLP, a speech front-end signal processing program . An implementation of RASTA for SPERT had a significant impact in its day. A recent study reported that RASTA-PLP computation took only 6.7% of the run time of a recognition task . Clearly, the performance bottlenecks have shifted with advances in speech recognition technology.