This example contrasts the execution of an algorithm on Defoe and a super scalar processor (Intel Pentium). The C language function absdiff computes the sum of absolute difference of two arrays A and B which contain 256 elements each.
{
int sum, diff, i;
sum = 0;
for(i = 0; i<256; i++)
{
diff = A[i] - B[i];
if(A[i] >= B[i])
sum = sum + diff;
else
sum = sum - diff; }
return sum;
}
Register assignment: On entry, r1 = a, r2 = b. On exit, sum is in r4.
1. add r3 = r1, 2040 // r3 = End of array A
2. add r4 = r0, r0 ; // sum = r4 = 0
.L1:
3. ld r5 = [r1] // load A[i]
4. ld r6 = [r2] // load B[i]
5. add r1 = r1, 8 // Increment A address
6. add r2 = r2, 8 // Increment B address
7. cmp.neq pr1 = r1, r3 ; // pr1 = (i != 255)
8. sub r7, r5, r6 // diff = A[i] - B[i]
9. cmp.gte pr2 = r5, r6 ; // pr2 = (A[i]
= B[i])
10. (pr2) add r4 = r4, r7 // if A[i]
= B[i]
// sum = sum + diff
11. (!pr2) sub r4 = r4, r7 // else sum = sum - diff
12. (pr1) br.sptk .L1 ;
Stack assignment: On entry, 12(%ebp) = B, 8(%ebp) = A. On exit, sum is in the eax register.
1. movl 12(%ebp), %edi // edi = B
2. xorl %esi, %esi // esi sum = 0
3. xorl %ebx, %ebx // ebx = 0
.p2align 2
.L6:
4. movl 8(%ebp), %eax // eax = A
5. movl (%eax,%ebx,4), %edx // edx = A[i]
6. movl %edx, %ecx // ecx = A[i]
7. movl (%edi,%ebx,4), %eax // eax = B[i]
8. subl %eax, %ecx // ecx = diff = A[i] - B[i]
9. cmpl %eax, %edx // A[i]
B[i]
10. jl .L7 // goto .L7 is A[i]
B[i]
11. addl %ecx, %esi // sum = sum + diff
12. jmp .L5
.p2align 2
.L7:
13. subl %ecx, %esi // sum = sum - diff
.L5:
14. incl %ebx // i++
15. cmpl $255, %ebx // i
= 255 ?
16. jle .L6
17. popl %ebx
18. movl %esi, %eax
It is worth noticing that the Defoe listing contains only one branch
(on line 12) whereas the x86 listing contains 3 branches. On a VLIW
processor, we can often use predicated instructions to eliminate branches.
In both listings, line 9 corresponds to the comparison of A[i]
and B[i]. The Pentium version does a conditional jump based on
the result of the comparison. On the other hand, the VLIW uses the
result of the comparison to set a predicate. The predicate is then
used to selectively write back the result of either the add or the
subtract operation and the result of the other operation is discarded.
This technique of converting a control dependence into a data dependence
is called `if conversion'. The benefits go beyond the single cycle
saved by not doing a jump as in the case of the super scalar processor.
The jumps on line 10 and 12 in the second listing depend on the condition
code which in turn depends on the data. Such data dependent branches
are difficult to predict. Assuming that A[i]
B[i] and
A[i]
= B[i] are equally likely, the super scalar processor
is likely to experience a branch misprediction and the resulting branch
penalty half of the time.
Going by the VLIW philosophy of conveying performance critical information from the compiler to the hardware, the final branch on line 12 uses the opcode modifier ``sptk'' to inform the processor that the branch is statically predicted to be taken. For that particular loop, a VLIW processor can therefore predict the loop accurately 255 times out of 256 loop iterations without any hardware branch predictor. Even when a hardware branch predictor is available, the instruction advises the processor not to waste a branch history table entry on that branch since its behavior is already known at compile time.