next up previous contents
Next: 4 Stream Processor Implementations Up: Stream Processors and their Previous: 2 The Stream Virtual   Contents

3 Time and Space Multiplexing

Consider a set of $n$ kernels named $K=\{ K_{1},K_{2},...,K_{n}\}$ and $m$ processors $P=\{ P_{1},P_{2},...,P_{m}\}$. Let $D(K_{i},P_{j})$ denote the execution time (delay) of kernel $K_{i}$when executed on processor (or processor set) $P_{j}$. When the set of processors is understood from the context, we will denote this as $D(K_{i})$. Let $\mathcal{P}(P)$ denote the power set of $P$. A schedule $S$ is a mapping $S(K)\to\mathcal{P}(P)\times t$ where $t$ is the start time of the kernel. We will here after use the notation $K_{i}.p$ and $K_{i}.t$ to denote the set of processors and start time assigned to kernel $K_{i}$. Legal mappings obey the following rules:

  1. $K_{j}.t\ge K_{i}.t+D(K_{i})$ when ever $K_{j}$depends on $K_{i}$, i.e., dependences are not violated.
  2. For all $i,j$ such that $i\ne j$, and $K_{i}.t\le K_{j}.t<K_{i}.t+D(K_{i},K_{i}.p)$, it should be the case that $K_{i}.p\cap K_{j}.p=\phi$, i.e., only one kernel may be executing on a processor at any time.
  3. For all $i$, $K_{i}.p\ne\phi$, i.e., every kernel should be allocated at least one processor.
A stream processing system whose schedules follow the rule, for all $i$, $K_{i}.p=P$ is called a time multiplexed system. In such a system, all available processing resources are allocated to one kernel and then to the next kernel and so on. A stream processing system whose schedules follow the rule, for all $i$, $\vert K_{i}.p\vert=1$ is called a space multiplexed system. In such a system, only one processor is ever allocated to one kernel. If there is some $i,j$ such that $i\ne j$, and $K_{i}.t\le K_{j}.t<K_{i}.t+D(K_{i},K_{i}.p)$ and $\vert K_{i}.p\vert>1$, the system is said to be space-time multiplexed. In that case, multiple kernels execute simultaneously and some kernels are given more than one processor. Figure 3 shows three different mappings for the application from Figure 1 on to the architecture from Figure 2. By our definitions, Figure 3(A) is a space multiplexed schedule, Figure 3(B) is a time multiplexed schedule and Figure 3(C) is a space-time multiplexed schedule. To simplify the example, DMA transfers are not shown.
Figure 3: Example Schedules for the Application from Figure 1
\includegraphics[%
width=5in,
keepaspectratio]{figures/multi-spi.eps}

To consider the relative benefits of time and space multiplexing we present a simplified analysis that considers only two kernels $K_{1}$and $K_{2}$ where $K_{2}$depends on $K_{1}$. A more detailed theoretical framework has been developed by the authors, but the details will be deferred to a later publication. Consider the case where the amount of data parallelism available is much larger than the number of processors $m$. For the space multiplexed case, we follow the schedule:

$S(K_{1})=(0,\{ P_{1},P_{2},...,P_{m/2-1}\})$ and $S(K_{2})$= $(D(K_{1}),\{ P_{m/2},P_{m/2+1},...,P_{m}\})$.

For the time multiplexed case, we follow the schedule:

$S(K_{1})=(0,\{ P_{1},P_{2},...,P_{m}\})$ and $S(K_{2})=(D(K_{1}),\{ P_{1},P_{2},...,P_{m}\})$.

Because of abundant data parallelism, we can assume that execution time of a kernel is inversely proportional to the number of processors allocated to it. Then, $Throughput_{time\, mux}=\frac{1}{D(K_{1})+D(K_{2})}$ and $Throughput_{space\, mux}$= $\frac{1}{MAX(2\times D(K_{1}),2\times D(K_{2}))}$= $\frac{1}{2\times D(K_{1})}$ (Arbitrarily picking $K_{1}$ as the larger term).

Therefore, $Throughput\, ratio_{time\, mux/space\, mux}=\frac{MAX(2\times D(K_{1}),2\times D(K_{2}))}{D(K_{1})+D(K_{2})}$. Since this ratio is greater than or equal to one, time multiplexing works better than space multiplexing in this case. Intuitively, the space multiplexed version works like a pipeline where the stage delays are unbalanced and the pipeline shifts at the speed of the slowest stage while the space multiplexed version is perfectly load balanced. In addition, it is possible to convert the time multiplexed system into an m-way SIMD version where all m copies share the same control logic and instruction memory leading to lower area and higher power efficiency. When the stages are balanced $MAX(2\times D(K_{1}),2\times D(K_{2}))=2\times D(K_{1})=2\times D(K_{2})$. Then the throughput ratio becomes one, but time multiplexing is still better because of higher area and power efficiency.

The situation is quite different when the data/instruction level parallelism is limited. Let kernels $K_{1}$ and $K_{2}$ each require $N_{1}$ and $N_{2}$ instructions worth of computation respectively. Assume that all instructions require one cycle. Further, assume that dependences limit the peak IPC possible to $I_{1}$ and $I_{2}$ where $I_{1},I_{2}\le m/2$. Then, $D(K_{1})=N_{1}/I_{1}$ and $D(K_{2})=N_{2}/I_{2}$. Then, $Throughput_{time\, mux}=\frac{1}{\frac{N_{0}}{I_{0}}+\frac{N_{1}}{I_{1}}}$ and $Throughput_{space\, mux}=\frac{1}{MAX(\frac{N_{0}}{I_{0}},\frac{N_{1}}{I_{1}})}$. Therefore,
$Throughput\, ratio_{time\, mux/space\, mux}=\frac{MAX(\frac{N_{0}}{I_{0}},\frac{N_{1}}{I_{1}})}{\frac{N_{0}}{I_{0}}+\frac{N_{1}}{I_{1}}}$.

Since this quantity is less than one, space multiplexing is the better alternative in this case. Intuitively, time multiplexing lets execution resources go waste while space multiplexing shares the resources leading to better performance. As before, when the load is perfectly balanced they have equivalent throughput, but time multiplexing is the better option in that case. On a practical note, when time multiplexed architectures are based on very wide SIMD execution, it is often cumbersome to re-formulate algorithms to match the wide SIMD model. Programmers might find it much more convenient to express an algorithm as a pipeline of tasks in a form suitable for space multiplexing. Programming languages like StreamIt attempt to automatically compile space-multiplexed code to space-time multiplexed binaries, but further research in this area is required to take advantage of the efficiency of time multiplexed architectures.


next up previous contents
Next: 4 Stream Processor Implementations Up: Stream Processors and their Previous: 2 The Stream Virtual   Contents
Binu K. Mathew, Ali Ibrahim