Consider a set of
kernels named
and
processors
. Let
denote the execution time (delay) of kernel
when executed
on processor (or processor set)
. When the set of processors
is understood from the context, we will denote this as
.
Let
denote the power set of
. A schedule
is a mapping
where
is the start
time of the kernel. We will here after use the notation
and
to denote the set of processors and start time assigned
to kernel
. Legal mappings obey the following rules:
To consider the relative benefits of time and space multiplexing we
present a simplified analysis that considers only two kernels
and
where
depends on
. A more detailed theoretical
framework has been developed by the authors, but the details will
be deferred to a later publication. Consider the case where the amount
of data parallelism available is much larger than the number of processors
. For the space multiplexed case, we follow the schedule:
and
=
.
For the time multiplexed case, we follow the schedule:
and
.
Because of abundant data parallelism, we can assume that execution
time of a kernel is inversely proportional to the number of processors
allocated to it. Then,
and
=
=
(Arbitrarily picking
as the larger term).
Therefore,
.
Since this ratio is greater than or equal to one, time multiplexing
works better than space multiplexing in this case. Intuitively, the
space multiplexed version works like a pipeline where the stage delays
are unbalanced and the pipeline shifts at the speed of the slowest
stage while the space multiplexed version is perfectly load balanced.
In addition, it is possible to convert the time multiplexed system
into an m-way SIMD version where all m copies share the same control
logic and instruction memory leading to lower area and higher power
efficiency. When the stages are balanced
.
Then the throughput ratio becomes one, but time multiplexing is still
better because of higher area and power efficiency.
The situation is quite different when the data/instruction level parallelism
is limited. Let kernels
and
each require
and
instructions worth of computation respectively. Assume
that all instructions require one cycle. Further, assume that dependences
limit the peak IPC possible to
and
where
.
Then,
and
. Then,
and
.
Therefore,
.
Since this quantity is less than one, space multiplexing is the better alternative in this case. Intuitively, time multiplexing lets execution resources go waste while space multiplexing shares the resources leading to better performance. As before, when the load is perfectly balanced they have equivalent throughput, but time multiplexing is the better option in that case. On a practical note, when time multiplexed architectures are based on very wide SIMD execution, it is often cumbersome to re-formulate algorithms to match the wide SIMD model. Programmers might find it much more convenient to express an algorithm as a pipeline of tasks in a form suitable for space multiplexing. Programming languages like StreamIt attempt to automatically compile space-multiplexed code to space-time multiplexed binaries, but further research in this area is required to take advantage of the efficiency of time multiplexed architectures.