Consider a set of kernels named and processors . Let denote the execution time (delay) of kernel when executed on processor (or processor set) . When the set of processors is understood from the context, we will denote this as . Let denote the power set of . A schedule is a mapping where is the start time of the kernel. We will here after use the notation and to denote the set of processors and start time assigned to kernel . Legal mappings obey the following rules:
To consider the relative benefits of time and space multiplexing we present a simplified analysis that considers only two kernels and where depends on . A more detailed theoretical framework has been developed by the authors, but the details will be deferred to a later publication. Consider the case where the amount of data parallelism available is much larger than the number of processors . For the space multiplexed case, we follow the schedule:
and = .
For the time multiplexed case, we follow the schedule:
Because of abundant data parallelism, we can assume that execution time of a kernel is inversely proportional to the number of processors allocated to it. Then, and = = (Arbitrarily picking as the larger term).
Therefore, . Since this ratio is greater than or equal to one, time multiplexing works better than space multiplexing in this case. Intuitively, the space multiplexed version works like a pipeline where the stage delays are unbalanced and the pipeline shifts at the speed of the slowest stage while the space multiplexed version is perfectly load balanced. In addition, it is possible to convert the time multiplexed system into an m-way SIMD version where all m copies share the same control logic and instruction memory leading to lower area and higher power efficiency. When the stages are balanced . Then the throughput ratio becomes one, but time multiplexing is still better because of higher area and power efficiency.
The situation is quite different when the data/instruction level parallelism
is limited. Let kernels and each require
and instructions worth of computation respectively. Assume
that all instructions require one cycle. Further, assume that dependences
limit the peak IPC possible to and where
Since this quantity is less than one, space multiplexing is the better alternative in this case. Intuitively, time multiplexing lets execution resources go waste while space multiplexing shares the resources leading to better performance. As before, when the load is perfectly balanced they have equivalent throughput, but time multiplexing is the better option in that case. On a practical note, when time multiplexed architectures are based on very wide SIMD execution, it is often cumbersome to re-formulate algorithms to match the wide SIMD model. Programmers might find it much more convenient to express an algorithm as a pipeline of tasks in a form suitable for space multiplexing. Programming languages like StreamIt attempt to automatically compile space-multiplexed code to space-time multiplexed binaries, but further research in this area is required to take advantage of the efficiency of time multiplexed architectures.