Figure 10.3 shows the throughput of the perception processor, the Pentium 4 and the XScale processors as well as ASIC implementations. Throughput is defined as the number of input packets processed per second and the results shown in Figure 10.3 are normalized to the throughput of the Pentium 4. The perception processor operating at 1 GHz outperforms the 2.4 GHz Pentium 4 by a factor of 1.75 (Geometric Mean). The perception processor's mean throughput is 41.4% of that of the ASIC implementations (GAU, Rowley, FIR, Rijndael). This is severely skewed by the fact that the ASIC implementations, particularly Rijndael, expend vastly more hardware resources than the perception processor. This is evident from Figure 10.2, which shows that in the case of Rijndael, the ASIC consumes more than twice the power of the perception processor. For the set GAU, Rowley and FIR, the perception processor in fact achieves on average 84.6% of the throughput of the ASIC implementation. These results clearly demonstrate the benefit of the perception architecture to the problems posed by perceptual algorithms.
Two of the benchmarks demand further explanation. FFT is the only benchmark where the Pentium outperforms the perception processor. This is due to the fact that the version of FFT used on the Pentium is based on FFTW, one of the fastest FFT libraries in existence. It uses a mixture of processor specific measurements and dynamic programming optimizations to adapt itself to the specific system it is run on. The perception processor on the other hand uses a simple radix-2 algorithm as does the XScale implementation. This is on account of the fact that FFTW is implemented as a large C library and is difficult to reimplement manually in microcode without the aid of a C compiler that targets the perception processor. XScale lacks the floating point hardware to support FFTW. The radix-2 algorithm is not particularly well suited for the perception processor since it causes bad interconnect conflicts that lead to too high an initiation interval for the main loop. In spite of these adversities the perception processor implementation achieves 64% of the performance of the Pentium at less than half its clock frequency. DSP processors typically implement a bit-reversed address space to improve the performance of FFT . The main reason for the reasonable FFT performance of the perception processor is that it uses hardware support for vector indirect accesses to implement a bit-reversed addressing mode for this application. An indirection vector that corresponds to bit-reversed array indices is kept and used from the scratch SRAM.
The other outlier is Fleshtone, the benchmark on which the perception processor performs the best. Though this is a relatively simple algorithm, it involves numerous floating point operations. Since the number of operators far exceed the number of function units available on the perception processor, the dataflow graph of the algorithm was split into several small subgraphs, and multiple passes were made over an input packet (320 pixel raster line) to fully evaluate the algorithm. Numerous temporary values are generated in the process, and these are stored in the SRAMs between successive passes. The Pentium version on the other hand fully evaluates the algorithm on each pixel before moving on to the next pixel in the input packet. The floating point register stack in the x86 architecture is inadequate to capture the amount of temporary results created. This results in several unnecessary moves, exchanges, loads and stores of intermediate values. The main loop body generated by GCC contains over 80 instructions and takes more than 208 cycles on average per iteration. In the case of the perception processor, compiler controlled dataflow reduces the number of temporaries and the SRAM memory permits storage of a very large number of intermediate results - over 1600 values in six passes. Ultimately, this leads to the perception processor outperforming the Pentium by a factor of 6.4.