Abstract
Vector multiprocessors rely on both spatial and temporal parallelism for achieving significant speedup. For singly nested loops, we study the effect on the speedup of (1) loop fusion and (2) increasing the granule size of parallel-vector loops using extracted statements from scalar loops. The proposed optimization migrate vector statements from one loop to another, create new loops, and reduce others. Loops and statements that belong to strongly connected data paths are vertically fused, whenever possible, in order to promote chaining, register, and cache reuse. To reduce loop synchronization, horizontal fusion is also used for independent loops having compatible dependence types. Finally, vector operations are scheduled based on knowledge of the timing of arithmetic pipelines, load and store operations, and management of the available resource. Testing is carried out using synthetic Fortran programs on the Convex C240 vector multiprocessor. The proposed loop fusion improves the speedup by 18 to 43% over the C240 commercial optimizing compiler. Chaining-oriented scheduling and allocation yields 9 to 15% improvement over the highest optimization option of the C240 compiler.
| Original language | English |
|---|---|
| Pages (from-to) | 56-64 |
| Number of pages | 9 |
| Journal | Journal of Parallel and Distributed Computing |
| Volume | 31 |
| Issue number | 1 |
| DOIs | |
| State | Published - 15 Nov 1995 |
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Artificial Intelligence