next up previous contents
Next: 5 The Intel Itanium Up: 4 Defoe: An Example Previous: 4.8 Example 1   Contents

4.9 Example 2

This example contrasts the execution of an algorithm on Defoe and a super scalar processor (Intel Pentium). The C language function absdiff computes the sum of absolute difference of two arrays A and B which contain 256 elements each.

int absdiff(int *A, int *B)

{

  int sum, diff, i;

  sum = 0;

  for(i = 0; i<256; i++)

    {

        diff = A[i] - B[i];

        if(A[i] >= B[i])

          sum = sum + diff;

        else

         sum = sum - diff;       }

  return sum;

A hand assembled version of absdiff in Defoe assembly language is shown below. For clarity, it has been left unoptimized. An optimizing compiler will unroll this loop and software pipeline it.

Register assignment: On entry, r1 = a, r2 = b. On exit, sum is in r4.

Line # Code Comment

1. add r3 = r1, 2040 // r3 = End of array A

2. add r4 = r0, r0 ; // sum = r4 = 0

.L1:

3. ld r5 = [r1]      // load A[i]

4. ld r6 = [r2]      // load B[i]

5. add r1 = r1, 8    // Increment A address

6. add r2 = r2, 8    // Increment B address

7. cmp.neq pr1 = r1, r3 ; // pr1 = (i != 255)

8. sub r7, r5, r6 // diff = A[i] - B[i]

9. cmp.gte pr2 = r5, r6 ; // pr2 = (A[i] $>$= B[i])

10. (pr2) add r4 = r4, r7 // if A[i] $>$= B[i]

// sum = sum + diff

11. (!pr2) sub r4 = r4, r7 // else sum = sum - diff

12. (pr1) br.sptk .L1 ;

The corresponding code for an Intel processor is shown below. This is a snippet of actual code generated by the GCC compiler.

Stack assignment: On entry, 12(%ebp) = B, 8(%ebp) = A. On exit, sum is in the eax register.

Line # Code Comment

1. movl 12(%ebp), %edi // edi = B

2. xorl %esi, %esi // esi sum = 0

3. xorl %ebx, %ebx // ebx = 0

.p2align 2

.L6:

4. movl 8(%ebp), %eax // eax = A

5. movl (%eax,%ebx,4), %edx // edx = A[i]

6. movl %edx, %ecx // ecx = A[i]

7. movl (%edi,%ebx,4), %eax // eax = B[i]

8. subl %eax, %ecx // ecx = diff = A[i] - B[i]

9. cmpl %eax, %edx // A[i] $<$ B[i]

10. jl .L7 // goto .L7 is A[i] $<$ B[i]

11. addl %ecx, %esi // sum = sum + diff

12. jmp .L5

.p2align 2

.L7:

13. subl %ecx, %esi // sum = sum - diff

.L5:

14. incl %ebx // i++

15. cmpl $255, %ebx // i $<$= 255 ?

16. jle .L6

17. popl %ebx

18. movl %esi, %eax

The level of parallelism available in the Defoe listing lines 3-7 (five issue) can be achieved on a super scalar processor only if the processor can successfully isolate the five independent operations fast enough to issue them all during the same cycle. Dependency checking in h/w is extremely complex and adds to the delay of super scalar processors. The x86 being a register deficient CISC architecture also incurs additional penalties because of register renaming and CISC to internal RISC format translation.

It is worth noticing that the Defoe listing contains only one branch (on line 12) whereas the x86 listing contains 3 branches. On a VLIW processor, we can often use predicated instructions to eliminate branches. In both listings, line 9 corresponds to the comparison of A[i] and B[i]. The Pentium version does a conditional jump based on the result of the comparison. On the other hand, the VLIW uses the result of the comparison to set a predicate. The predicate is then used to selectively write back the result of either the add or the subtract operation and the result of the other operation is discarded. This technique of converting a control dependence into a data dependence is called `if conversion'. The benefits go beyond the single cycle saved by not doing a jump as in the case of the super scalar processor. The jumps on line 10 and 12 in the second listing depend on the condition code which in turn depends on the data. Such data dependent branches are difficult to predict. Assuming that A[i] $<$ B[i] and A[i] $>$= B[i] are equally likely, the super scalar processor is likely to experience a branch misprediction and the resulting branch penalty half of the time.

Going by the VLIW philosophy of conveying performance critical information from the compiler to the hardware, the final branch on line 12 uses the opcode modifier ``sptk'' to inform the processor that the branch is statically predicted to be taken. For that particular loop, a VLIW processor can therefore predict the loop accurately 255 times out of 256 loop iterations without any hardware branch predictor. Even when a hardware branch predictor is available, the instruction advises the processor not to waste a branch history table entry on that branch since its behavior is already known at compile time.


next up previous contents
Next: 5 The Intel Itanium Up: 4 Defoe: An Example Previous: 4.8 Example 1   Contents
Binu K. Mathew