This example contrasts the execution of an algorithm on Defoe and a super scalar processor (Intel Pentium). The C language function absdiff computes the sum of absolute difference of two arrays A and B which contain 256 elements each.
int sum, diff, i;
sum = 0;
for(i = 0; i<256; i++)
diff = A[i] - B[i];
if(A[i] >= B[i])
sum = sum + diff;
sum = sum - diff; }
Register assignment: On entry, r1 = a, r2 = b. On exit, sum is in r4.
1. add r3 = r1, 2040 // r3 = End of array A
2. add r4 = r0, r0 ; // sum = r4 = 0
3. ld r5 = [r1] // load A[i]
4. ld r6 = [r2] // load B[i]
5. add r1 = r1, 8 // Increment A address
6. add r2 = r2, 8 // Increment B address
7. cmp.neq pr1 = r1, r3 ; // pr1 = (i != 255)
8. sub r7, r5, r6 // diff = A[i] - B[i]
9. cmp.gte pr2 = r5, r6 ; // pr2 = (A[i] = B[i])
10. (pr2) add r4 = r4, r7 // if A[i] = B[i]
// sum = sum + diff
11. (!pr2) sub r4 = r4, r7 // else sum = sum - diff
12. (pr1) br.sptk .L1 ;
Stack assignment: On entry, 12(%ebp) = B, 8(%ebp) = A. On exit, sum is in the eax register.
1. movl 12(%ebp), %edi // edi = B
2. xorl %esi, %esi // esi sum = 0
3. xorl %ebx, %ebx // ebx = 0
4. movl 8(%ebp), %eax // eax = A
5. movl (%eax,%ebx,4), %edx // edx = A[i]
6. movl %edx, %ecx // ecx = A[i]
7. movl (%edi,%ebx,4), %eax // eax = B[i]
8. subl %eax, %ecx // ecx = diff = A[i] - B[i]
9. cmpl %eax, %edx // A[i] B[i]
10. jl .L7 // goto .L7 is A[i] B[i]
11. addl %ecx, %esi // sum = sum + diff
12. jmp .L5
13. subl %ecx, %esi // sum = sum - diff
14. incl %ebx // i++
15. cmpl $255, %ebx // i = 255 ?
16. jle .L6
17. popl %ebx
18. movl %esi, %eax
It is worth noticing that the Defoe listing contains only one branch (on line 12) whereas the x86 listing contains 3 branches. On a VLIW processor, we can often use predicated instructions to eliminate branches. In both listings, line 9 corresponds to the comparison of A[i] and B[i]. The Pentium version does a conditional jump based on the result of the comparison. On the other hand, the VLIW uses the result of the comparison to set a predicate. The predicate is then used to selectively write back the result of either the add or the subtract operation and the result of the other operation is discarded. This technique of converting a control dependence into a data dependence is called `if conversion'. The benefits go beyond the single cycle saved by not doing a jump as in the case of the super scalar processor. The jumps on line 10 and 12 in the second listing depend on the condition code which in turn depends on the data. Such data dependent branches are difficult to predict. Assuming that A[i] B[i] and A[i] = B[i] are equally likely, the super scalar processor is likely to experience a branch misprediction and the resulting branch penalty half of the time.
Going by the VLIW philosophy of conveying performance critical information from the compiler to the hardware, the final branch on line 12 uses the opcode modifier ``sptk'' to inform the processor that the branch is statically predicted to be taken. For that particular loop, a VLIW processor can therefore predict the loop accurately 255 times out of 256 loop iterations without any hardware branch predictor. Even when a hardware branch predictor is available, the instruction advises the processor not to waste a branch history table entry on that branch since its behavior is already known at compile time.