10.3 Experimental Method
Hardware netlists for two different perception processor configurations were generated for this evaluation. They will henceforth be referred to as the integer cluster and the floating point cluster. The integer cluster consists of four ALUs, two multiply units, and the remaining two slots are unused. The floating point cluster contains four ALUs and four FPUs. All of the integer benchmarks except FIR and Viola would run equally well on the floating point cluster. FIR and Viola require integer multiply operations. The hardware for each configuration (the entire organization shown in Figure 9.1) was generated. The input and scratch SRAMs are sized at 8 KB each and the output SRAM is 2 KB in size. The design is simulated at the transistor level using Spice while running the microcode for the benchmarks. The Spice simulation provides a supply current waveform with one sample per 100 pico seconds. This information along with the supply voltage is used to compute instantaneous power consumption. Then numerical integration of power over time is performed to compute energy consumption.
The dual-ported SRAMs are macrocells generated by an SRAM generator tool and simulating the entire SRAM array using Spice is not feasible. For the SRAMs each read, write and idle cycle were logged. The normalized energy consumption was then computed based on the read, write and idle current reported by the SRAM generator. Each benchmark is run for several thousand cycles until the energy estimate converges. This chapter assumes a framework similar to Figure 1.2 where a host processor and memory controller combination transfers data into and out of the perception processor's local SRAMS. The perception processor operates only on data present in local SRAM and has no means of accessing main memory. To isolate main memory system power consumption and compare the merits of the processors in a fair manner, both the perception processor and the general purpose processors are forced to repeatedly reuse data which has already been transferred into on-chip memory. The host processor is not simulated.
The function units are described in Verilog and the Synopsys module compiler language. The overall cluster organization and interconnection between function units is automatically generated by the compiler. The whole design is then synthesized to the gate level and a clock tree is generated. The net list is then annotated with heuristic worst case RC wire loads assuming all routing happened on the lowest metal layer. The energy measurements are therefore likely to be pessimistic. Exact measurements are extremely sensitive to wire routing decisions, and as a result wire capacitance calculations were based on the worst-case wiring layer. The microcode corresponding to the benchmark is loaded into program memory and the Spice model is simulated in NanoSim, a commercial VLSI tool with Spice-like accuracy. The circuits were originally designed for a 0.25 CMOS process but were subsequently retargeted to a 0.13 process [22,23]. Only the 0.13 results are reported here.
The software version of each benchmark was compiled with the GNU GCC compiler using the O3 optimization level and run on a 2.4 GHz Intel Pentium 4 processor. This system has been modified at the board level to permit measuring average current consumed by the processor module using a digital oscilloscope and nonintrusive current probe. Several million iterations of each benchmark algorithm were run with the same input data to ensure that the input data always hits in the L1 Cache. So the L2 Cache and memory system effects are isolated as much as possible and the measurement represents core power. For the XScale system a similar approach is used except that software control is used to turn off unnecessary activity and the difference between the quiescent state and the computation is measured. This method could slightly inflate the processor power, but measuring the core power alone is not technically feasible on this system due to packaging constraints. The choice of both systems were based on the technical feasibility of PCB modifications to permit measuring energy consumption.
Embedded processors like the XScale do not have floating point instructions that are required for some of the benchmarks. Software emulated floating point will bloat the energy delay product of the XScale and make a meaningful comparison impossible. Therefore the comparison is done against an ideal XScale, which has FPUs that have the same latency and energy consumption as an integer ALU. This is done by replacing each floating point operator in the code with a corresponding integer operator. The code is then run on a real XScale processor. Henceforth, the name XScale refers to the idealized XScale implementation. Floating point units typically incur several times the latency and power overheads of their integer counterparts. The results computed by the algorithm after replacing floating point operators with integer operators are meaningless, but the performance and energy consumption represent a lower bound for any real XScale implementation with FPUs. This makes the XScale results look better than they really are.