An Algebra for Deriving Ef?cient Implementations for an Array Processor Parallel Computer from Functional Speci?cations
Technical Report 1995/Jun.SF.AS.MC.JMB
Stephen Fitzpatrick1, A. Stewart, M. Clint, J. M. Boyle2y June 1995
yMathematics and Computer Science Division, The Queen’s University of Belfast, Department of Computer Science, Argonne National Laboratory, Belfast BT7 1NN, Argonne IL 60439, USA. Northern Ireland. S.Fitzpatrick, A.Stewart, M.Clint @cs.qub.ac.uk, firstname.lastname@example.org
Abstract We present a set of program transformations which are applied automatically to convert an abstract functional speci?cation of numerical algorithms into ef?cient implementations tailored to the AMT DAP array processor. The transformations are based upon a formal algebra of a functional array form, which provides a functional model of the array operations supported by the DAP programming language. The transformations are shown to be complete. We present speci?cations and derivations of two example algorithms: an algorithm for computing eigensystems and an algorithm for solving systems of linear equations. For the former, we compare the execution performance of the implementation derived by transformation with the performance of an independent, manually constructed implementation; the ef?ciency of the derived implementation matches that of the manually constructed implementation. Keywords: program transformation, program derivation, normal forms, functional speci?cation, AMT DAP array processor
The implementation of numerical mathematical algorithms on modern, high-performance computers presents an interesting contrast: most algorithms in this class have clear, easy-to-follow speci?cations, yet ef?cient implementations for high-performance computers are neither clear nor easy-to-follow. That numerical mathematical algorithms have transparent speci?cations is not surprising — their mathematical foundation provides a coherent, logical and systematic framework and a rich body of knowledge that may be used to construct their speci?cations. That acceptable implementations of numerical mathematical algorithms are rarely clear or easy-to-follow (or even correct!) is also not surprising — a programmer must usually formulate an implementation which
1 This work is supported by a research studentship from the Department of Education, Northern Ireland. 2 This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Of?ce of
Computational and Technology Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.
differs radically from the speci?cation in order to comply with the programming model supported by a particular implementation language and to exploit the speci?c hardware architecture in use (and thus improve the execution performance of the implementation). For example, a programmer may attempt to express certain operations in whole array or vector forms or to split a time-consuming task amongst a number of co-operating processes that execute concurrently on a parallel machine. Each implementation technique exacts a price as far as the clarity of the implemented program is concerned. When several implementation techniques are combined, the implementation becomes so complex that its relationship to the original algorithm speci?cation is not apparent. Ef?cient implementations are thus often dif?cult to construct, to verify, to maintain and to adapt for execution on other computer systems. In this paper, we discuss a method for automatically deriving ef?cient implementations from abstract speci?cations through program transformation: the transformational programmer constructs an abstract algorithm speci?cation in a clear, natural style, paying no heed to ef?ciency, and initiates the application of a sequence of meaning-preserving program transformations which implement the abstract constructs of the speci?cation language in some chosen target language (usually a Fortran or C dialect), eliminate inef?ciencies occasioned by the clear style of the speci?cation and tailor the implementation for the chosen computer system. The programmer’s task is thus changed from manually constructing a single implementation of a speci?cation, to developing systematic implementation methods, and to encoding these methods as program transformations. Most transformations are independent of the particular algorithm being implemented, so a programmer’s efforts will be be reused to produce implementations of other algorithms. In addition, many transformations are independent of the particular computer system for which an implementation is being derived, and can be reused for many computer systems. A single speci?cation may serve as the source from which multiple implementations are derived, each implementation being tailored for a particular computer system. The automated derivation of implementations for sequential and vector computer systems has been discussed previously ; in this paper, we extend this work to the derivation of implementations for the AMT DAP array processor. In Section 2 we discuss the speci?cation language we use, a functional programming language ML ; functional programming languages provide a convenient basis for expressing numerical mathematical algorithms [15, 36, 44]. We also discuss a small set of functions that support array operations in ML, and we illustrate the use of these functions by de?ning common matrix and vector operations. In Section 3 we specify two signi?cant algorithms that are used to solve problems frequently encountered in scienti?c and engineering applications (the computation of eigensystems and the solution of systems of linear equations). We then outline the target architecture in Section 4. In Section 5, we discuss the transformation system and the transformations used to produce DAP implementations. Example applications of the transformation are outlined in section 6 and an analysis of the execution performance of the derived implementations is given in section 7. A discussion of related work and conclusions are presented in sections 8 and 9.
A Functional Speci?cation Language for Numerical Mathematical Algorithms
We use a (small) subset of the language constructs of ML as an algorithm speci?cation language. We apply the term functional speci?cation to an ML de?nition to convey that the de?nition is intended as an abstract speci?cation of an algorithmic solution to a problem, not a concrete program to be executed in order to compute a solution ef?ciently. By regarding an ML de?nition as a speci?cation, we liberate its style from all demands for ef?cient execution. Speci?cations can then be written in a style and using those techniques that produce the greatest degree of clarity, the strongest guarantee of correctness, and the greatest degree of adaptability. The problem of creating an executable, ef?cient, concrete implementation by automated program transformation is addressed later.
Vector and Matrix Primitives
Algorithms in numerical linear algebra are conventionally expressed in terms of operations on vectors and matrices, which we support through a standard library of operations, based upon an array data abstraction. An array is de?ned as a mapping from a Shape to a set of values of a particular type:
array : Shape!
A Shape de?nes a set of indices (where an index is a list of integers specifying a position). We use the term Shape to emphasize that, in array operations, the set is usually regular; i.e. it can be speci?ed using a small number of parameters. In this paper, a Shape is de?ned by a number of dimensions with the details of each dimension expressed as a triple of the form:
lower upper step]
where lower is the smallest value in the set, upper is the largest value in the set and step is the offset between adjacent values. For example, a two dimensional 4 4 Shape may be de?ned as
1 4 1] 1 4 1]]
and denotes the set of indices For brevity, we use
n] to denote a dimension with unit lower bound and offset; for example: n n] 1 n 1] 1 n 1]] . 1 2] and 4 1] are
f i j]ji21::4^j21::4g .
The elements of a shape are indices, which are denoted as lists of values; for example, indices in the above 4 4 shape.
The library operations are de?ned in terms of three primitive functions for array element selection, array creation and array reduction. Element Selection Element selection is denoted using the function element; for example, element(A i) is (the value of) the element of array A at the position speci?ed by index i. For convenience, an operator notation
element : array index!
A@i element(A i)
is also used. Array Creation
generate : Shape (index! )! array An application of generate (called a generation) creates an array of the speci?ed shape having
elements given by applying the second argument (a function, called the generating function) to each index in the shape. Formally, generate is de?ned by:
v2S ) element(generate(S x E) v)
x E(v) E
x where x E denotes a function with formal argument x and with body E , and where Ev denotes the result of substituting v for all free occurrences of x in the expression E . For example,
element(generate(S i j] i + j) 1 2]) = (i + j)
= 1+2= 3
reduce : shape (index! ) (
A reduction combines a set of values into a cumulative value by means of a binary reducing function (the third argument to reduce). The set of values to be reduced is produced by applying a generating function (the second argument) to each index in a shape (the ?rst argument). The ?nal argument is the initial value — it is used to instantiate the reduction by inclusion in the set of values to be reduced (so that the application of a binary reducing function to the set is a valid operation even when the set contains only a single element). Formally, reduce can be de?ned by:
reduce(fg x E op a) = a reduce(S fyg x E op a) = reduce(S x E op a op E )
No order for applying the generating or reducing functions is speci?ed, so the reducing function should be associative and commutative. Common examples of reductions are summing of the elements of a matrix, conjoining the elements in a boolean matrix, and determining the maximum value in a matrix. The two functions size and shape are also used: size(A by n; shape(A) returns the shape of A.
n) returns the size of A in the dimension speci?ed
The primitive array functions have been designed to provide a convenient means for de?ning common mathematical operations and to avoid biasing functions in favour of any particular computer architecture. For example, an application of generate or reduce can be evaluated either sequentially or in parallel — no order is speci?ed for applying the generating function to the indices, or (in the case of reduce) for combining values. Algebraic laws for the array operations are presented in Section 5.3.1.
Speci?cations for Numerical Mathematical Algorithms
A library of standard matrix and vector operations is de?ned in terms of the primitive array operations. Most of the library operations are simple recastings of the conventional mathematical de?nitions. For example: Matrix Addition Mathematical de?nition: (A + B) i j] = A i j] + B i j] ML speci?cation:1 plus(A, B) = generate(shape(A), [i, j] A@[i, j]+B@[i, j]) Matrix Transpose Mathematical de?nition: AT i j] = A j i] ML speci?cation: transpose(A) = generate(Shape.transpose(shape(A)), [i, j] A@[j, i]) Vector Inner Product Mathematical de?nition: (U V ) = U 1] V 1] + : : : + U n] V n] ML speci?cation: inner˙product(U, V) = reduce(shape(U), [i] U@[i]*V@[i], +, 0.0) Matrix Multiplication Mathematical de?nition: (A B) i j] = (row(A i) column(B j)) ML speci?cation: multiply(A, B) = generate([size(A, 1), size(B, 2)], [i, j] inner˙product(row(A, i), column(B, j)))
1 In this paper,
-expressions are used to denote function expressions; ML uses the equivalent notation
In each case the ML de?nition and the conventional mathematical form are closely related. The de?nitions of commonly used functions such as plus, transpose, inner product, and multiply have been placed in a library of numerical mathematical functions which are used in speci?cations. In most cases, the functions are invoked using standard operator notation — the speci?cation language permits operators to be overloaded, so that, for example, ‘+’ denotes matrix addition as well as integer addition and real addition. The simplicity and clarity of functional programs make them particularly satisfactory for specifying numerical computations, especially when the basic speci?cation language is enhanced by the inclusion of data abstractions. Data abstractions make it possible to introduce concepts and notations that are suited to the problem domain of a speci?cation, or even to the particular problem under consideration.
In this section, we present speci?cations for two useful numerical mathematical algorithms as examples of more complex speci?cations.
An Algorithm for Computing Eigensystems — Parallel Orthogonal Transformations (POT)
The eigensystem (Q ) of a matrix A of order n satis?es the equation, AQ = Q , where is a diagonal matrix with the eigenvalues 1 : : : n of A as its diagonal elements, and the columns of Q are the corresponding eigenvectors. If A is symmetric, Q is guaranteed to be non-singular and is, in addition, orthogonal. POT  computes the eigensystem of a symmetric matrix by constructing a sequence of orthonormal matrices of eigenvector approximations, Uk , and a sequence of similar symmetric matrices, Bk , thus:
singular matrix argument using the Modi?ed Gram-Schmidt method. The columns of the argument matrix are orthogonalized in an order determined by the magnitude of the diagonal elements of Bk : the column corresponding to the diagonal element with greatest magnitude is orthogonalized ?rst; the column corresponding to the diagonal element with second greatest magnitude is orthogonalized second; and so on. POT may be realized by the following application of a function P ot (eigenvectors, eigenvalue˙matrix) = Pot(A, identity˙matrix(shape(A))) whose de?nition in Standard ML is fun Pot(A:real Array, U:real Array): real Array*real Array = let val B = transpose(U)*(A*U) in if (is˙satisfactory(B)) then (U, B) else Pot(A, ortho(A*U*transform(B), diagonal(B)) end; 5
U =I 2. U = ortho(AU transform(B ) diagonal(B )) k 0 3. B = A 4. B = U AU Then lim !1 fB g = and lim !1 fU g = Q. The function transform is de?ned below. The function ortho orthonormalizes the columns of its non1.
where let . . . in . . . end de?nes a local expression: the identi?er B is bound to the speci?ed value (transpose. . . ) during evaluation of the conditional expression; the value of the conditional expression is the value of the whole local expression.
T This de?nition follows directly from the description of POT given above: if B k = Uk AUk is suf?ciently close to being diagonal (as determined by the function is˙satisfactory) then U k is the matrix of eigenvectors of Q and the diagonal elements of Bk are the required eigenvalues; otherwise a more accurate approximation to Q is derived and Pot is re-applied with this new approximation as its second argument.
The operation transform produces from its matrix argument a matrix Tk which, ignoring its diagonal, is anti-symmetric and each column of which is an approximation to an eigenvector of Bk . The components of Tk are computed as shown in Figure 1(a). The ML speci?cation shown in Figure 1(b) de?nes a function transform that realizes the transform operation (the operator ? denotes negation). This speci?cation uses generate to construct the transformation matrix, Tk , which has the same shape as Bk . A function Calculate computes the value of the (i j)th element of the transformation matrix. The generating function embodies the cases required by the speci?cation. A similar development yields a speci?cation for the function ortho.
where dij = bjj bii , and where bij is a typical element of Bk
8 > < t = >1 :
dij +sign(dij )
i>j i=j i<j
fun transform(B:real Array):real Array = let fun Calculate(i:int, j:int):real = let val d = B@[j, j]-B@[i, i] in 2*B@[i, j]/(d+sign(d)*sqrt(sqr(d)+4*sqr(B@[i, j]))) in generate(shape(B), [i, j] if (i?j) then Calculate(i, j) else if (i=j) then 1.0 else ?Calculate(j, i)) end
(a) Mathematical de?nition 
(b) ML speci?cation
Figure 1: The transform operation
A Conjugate Gradient Algorithm
The Conjugate Gradient algorithm uses an iterative process to compute (an approximation to) the vector x of order n satisfying the equation Ax = b where A is a positive-de?nite, symmetric matrix of order n n and b is a vector of order n. To solve Ax = b, where A is a positive-de?nite symmetric n Set an initial approximation vector x 0 , calculate the initial residual r0 = b Ax0 , set the initial search direction p0 = r0; then, for i = 0 1 : : :, (a) calculate the coef?cient i = pT ri=pT Api , i i (b) set the new estimate xi+1 = xi + i pi , (c) evaluate the new residual ri+1 = ri i Api , (d) calculate the coef?cient i = ri+1Api =pT Api , i (e) determine the new direction pi+1 = ri+1 + i pi , continue until either r i or pi is zero.
Figure 2: Mathematical de?nition of Conjugate Gradient (J.J. Modi , p152) The name ‘Conjugate Gradient’ often refers to a class of algorithms which employ the basic method de?ned in Figure 2, rather than to a speci?c algorithm. The particular version used here is known as a bi-conjugate 6
gradient algorithm; the functional speci?cation is shown in Figure 3. 2 type cgstate = real vector*real vector*real vector*real vector; fun cgiters(a:real matrix, b:real vector):cgstate = let (* Terminating condition.*) fun has˙converged((x, r, p, q):cgstate):bool = inner˙product(r, r)?epsilon; (* One iteration.*) fun cgiter((x, r, p, q):cgstate):cgstate = let val rr:real = innerproduct(r, r); val alpha:real = rr/innerproduct(q, q); val x0:real vector = x+p*alpha; val atq:real vector = transpose(a)*q; val r0:real vector = r-atq*alpha; val beta:real = innerproduct(r0, r0 )/rr; val p0 :real vector = r0 +p*beta; val q0 :real vector = a*r0 +q*beta in cgstate(x0, r0, p0 , q0 ) end in iterate(cgiter, cgstate(x0, r0, p0, q0), has˙converged) end Figure 3: SML speci?cation of conjugate gradient The algorithm is based upon manipulation of a collection of vectors x, r, p and q (x being the current approximation to the solution); the type cgstate is de?ned to represent this collection of vectors, as a 4-tuple of real vectors. Instances of the cgstate type are constructed using the function cgstate. The function cgiters takes A and b as arguments and returns a cgstate whose ?rst component is the solution. The speci?cation uses the rithm.
iterate library function to perform the repetition required by the algo-
– The ?rst argument to iterate is a function cgiter de?ning the computation that is to be repeated. – The second argument is a value (an instance of cgstate) with which to initialize the process. – The third argument, has˙converged, is a function which determines when the repetition is to cease (i.e. when the approximation to the solution is suf?ciently accurate). The function de?ning the repeated computation, cgiter, takes a single argument of type cgstate and returns a value of the same type. In the speci?cation, pattern matching is used to bind the names x, r, p and q to the four components of the cgstate argument.
The body of cgiter computes the next collection of vectors as local values x0 , r0 , p0 and q0 and returns these values as an instance of cgstate. For brevity, the computation of the initial values x0, r0, p0 and q0 is not shown. The bulk of the computational costs are incurred by the two matrix-vector products in the computation of atq and q0 .
2 We emphasize that we are not interested in the merits and demerits of this particular speci?cation — it is merely one that was convenient for us to use as an example.
The functional speci?cations presented above are straightforward recastings of the mathematical de?nitions into the chosen speci?cation language. Although some of the syntactic detail differs from the mathematical form, the basic structures of the speci?cations mirror those of the mathematical de?nitions. The speci?cations should be readily understood by a reader with a knowledge of basic mathematics.
The Target Architecture: The AMT DAP 510
The AMT DAP 510  is a Single Instruction Multiple Datastream (SIMD) parallel computer system, consisting of a 32 by 32 grid of processing elements (see Figure 5) controlled by a separate master processor. The master processor — essentially a conventional 32 bit processor with some additional components for controlling the operations of the processing elements — performs most scalar calculations. The processing elements, which are single bit processors, perform the parallel processing operations. The master processor issues instructions to the processing elements, all of which obey the instruction simultaneously. The master processor may also issue data to the processing elements.
Processing element (i,j)
Memory element (i,j) in memory plane k
Figure 4: DAP memory planes
Figure 5: DAP processor grid
Each processing element has its own local memory to which it has direct access; no processing element has direct access to the memory of any other processing element. In general, in a given operation, all processing elements access the same component of their respective memories. Thus, the memory of all the processing elements may be thought of as consisting of a sequence of planes, the kth plane being the aggregate of the kth component of each processor’s memory; the processor grid may be thought of as performing operations on these memory planes (see Figure 4). When a processing element requires a value which is stored in the memory of another element, it must obtain the value by a communication mechanism. Each processing element is connected to its four nearest neighbours in the grid, an element on an edge being connected to the corresponding element on the opposite edge (directions on the grid are designated as north, south, east and west — see Figure 5); all of the processing elements can simultaneously obtain a value from one neighbour, though the direction in which each neighbour lies is the same across the entire grid. In addition to the nearest neighbour connections, the DAP hardware supports three broadcast mechanisms which can be used to duplicate values across the grid: a single scalar value can be broadcast to each processing element, or a set of 32 scalar values (called a vector) can be broadcast to each row or to each column of the grid. Associated with each processing element is an activity register which controls whether or not the element participates in certain operations. The activity mask (that is, the grid of 32 by 32 activity registers) can be set under program control and can thus be used to implement conditional operations. The DAP hardware also supports reduction operations (such as summation and conjunction) over the entire processor grid, and along only the rows or columns (to produce a vector of values).
The Target Language: Fortran Plus Enhanced
Fortran Plus Enhanced (FPE ) is an extension of standard Fortran allowing the processor grid of the AMT DAP to be used ef?ciently. It supports two non-conventional types, scalar vector and scalar matrix, which are similar to one dimensional and two dimensional arrays, but which have associated functions that make use of the processor grid. The size of vectors or matrices which may be used is limited only by the the amount of memory available, not by the size of the processor grid. Fortran Plus Enhanced subdivides a vector or matrix whose dimensions are larger than those of the processor grid into segments each of which is the size of the processor grid (if necessary, it pads the edges of the vector or matrix to make the size a multiple of the processor grid size). The features of Fortran Plus Enhanced that are important in the context of this paper are: Componental functions — scalar functions applied either to each element of a vector or matrix or to corresponding elements of a pair of vectors or matrices. The componental functions include common arithmetic and logical operations. e.g. A + B , for vectors and matrices A and B . Aggregate functions — certain elementwise reductions on a vector or matrix. e.g. sum(A), for a vector or matrix A. Vector or matrix assignment — simultaneous assignment of all elements of a vector or matrix. e.g.
A = B , for vectors or matrices A and B .
Masked assignment — vector or matrix assignment controlled by a mask (a boolean vector or matrix). Masked assignment affects only those elements of the left side vector or matrix for which the corresponding element of the mask is true. e.g. A(mask) = 1, which assigns 1 to matrix A where the mask mask is true. Masked vector or matrix assignment is the primary mechanism supporting conditional execution on the DAP processor array. Pattern functions — construction of vector or matrix masks having true elements arranged in certain commonly used patterns. For example: patunitdiag(N) is an N N matrix with true along its leading diagonal and false everywhere else; patlowertri(N) is an N N matrix with true in its lower triangle (the area on and below the leading diagonal) and false everywhere else. Geometric functions — functions to re-arrange the order of elements in a vector or matrix. e.g. transpose(A), for matrix A. Extractions — a vector with elements equal to the elements of a given row or column of a matrix. e.g. A(1 ) is row 1 of matrix A. Complex extraction functions — extractions performed using a mask as an index. For example, if M is a boolean matrix with one and only one element true in each row, then the positions of the true elements can be used to extract a vector from a matrix A, where A has same size as a column of M . For example, patunitdiag(n) is a boolean matrix with true values along the main diagonal; A(patunitdiag(n) ) is a vector comprising the diagonal elements of A.
Expansion functions — a vector or matrix having each element equal to a given scalar value, or a matrix having each row or each column equal to a given vector. e.g. mat(1:0 m n) is an m n matrix with each element having the value 1.0, and matr(V m) is a matrix having m rows each of which is a copy of the vector V . Shifts — vectors or matrices with all elements translated in the same direction. For example, a north shift moves all the elements of a matrix to the north, introducing null values along the south edge. To run ef?ciently on the DAP, a program must be expressed almost entirely in terms of the operations described; operations which cannot be expressed as combinations of these operations are executed on the scalar processor, resulting in much slower execution than is achievable on the processor array.
Transforming Functional Speci?cations to Ef?cient Programs
The TAMPR program transformation system [10, 14] can be employed to apply program transformations to derive ef?cient Fortran or C programs from higher-order functional speci?cations. Each TAMPR transformation rule is a rewrite rule, having a pattern and a replacement, both of which are speci?ed in terms of the grammar of the programming language. Most of the transformations that carry out such a derivation are independent of the problem being solved and of the target hardware, and so can be employed in derivations for any problem domain and for any target hardware architecture. As we discuss, however, one can easily add a few problem-domain-speci?c or hardware-speci?c transformations to the derivation to produce highly ef?cient code. Typically, a derivation is structured into a sequence of major stages, each of which consists of a short sequence of transformation sets. TAMPR applies each transformation set once in turn, but exhaustively applies all transformations comprising that set. The total number of transformation applications may be large: for the POT speci?cation, for example, the entire derivation from ML to Fortran Plus Enhanced requires about 10000 rewrites. Clearly, it is vital that TAMPR supports the automatic application of the rules. It would be unrealistic to attempt to apply thousands of transformations by hand, or even to guide their application.
Sketch of the Basic Transformational Derivation
The stages in the basic transformational derivation are depicted in Figure 6, in which the boxes represent particular transformation sequences and the arcs represent the order in which particular stages may be combined. The starting point for the derivation is a pure, functional speci?cation (expressed in Lisp or ML); the result of the derivation is an imperative implementation expressed either in Fortran 77 or ANSI C. The speci?cation is transformed by: 1. converting the speci?cation into the abstract functional language used by the transformation system (essentially, the -calculus extended with named functions and type information); 2. standardizing the abstract functional language to facilitate later processing; 3. simplifying structure of the functional speci?cation by unfolding function de?nitions and evaluating certain resulting expressions; 4. converting the abstract functional form to an equivalent abstract imperative form; and 5. converting the abstract imperative language to the required implementation language.
Lisp Specification 1
ML Specifciation ML Conversion
Functional Language Canonicalization
Functional Language Canonicalization
Unfolding and Simplification
Array Algebra Optimizations
Unfolding and Simplification
Functional form Preparation
Functional form to Imperative form.
Tail Recusion elimination
Array Operation Generation
Functional to Imperative Mapping
Common Subexpression Elimination
Array Operations to DAP Operations
Fortran Standardization C Standardization
Functional form to Imperative form
Fortran Plus Enhanced
Figure 6: Basic Transformation Derivation Figure 7: AMT DAP Transformation Derivation
By removing the ‘syntactic sugar’ of the initial speci?cation (written in ML or in another functional language) the derivation is freed from the syntactic details of the functional language used for speci?cation and thus permits other speci?cation languages to be used with little additional effort. Unfolding function de?nitions ensures that the only (non-recursive) functions persisting in a speci?cation belong to a small set of designated ‘primitive’ functions, such as generate and reduce. When de?ning the semantics of speci?cations or when transforming speci?cations, only the primitive functions need be considered after unfolding has been performed. The Unfolding and Simpli?cation stage (stage 3) may be omitted. The conversion of an abstract functional speci?cation into an equivalent abstract imperative form (step 4) is achieved by: 4.1. manipulating the functional speci?cation into a form that renders the conversion to imperative form a straightforward task; 4.2. performing tail recursion elimination on the abstract functional form; 4.3. mapping the language constructs of the abstract functional language onto equivalent constructs in the abstract imperative language (for example, conditional expressions are mapped on to conditional statements); and 4.4. implementing recursive functions by introducing a stack to store function arguments, return values and return addresses (thus removing the requirement on the implementation language to support recursive functions). The tail recursion elimination and stack implementation phases may be omitted. The transformations in the basic derivation provide the framework upon which other specialized derivations may be constructed. A more detailed discussion of the basic transformation steps, including some example code fragments generated at various stages, is given in [12, 15].
Transformational Derivation for the AMT DAP
For ef?cient execution on the AMT DAP, a speci?cation is recast into Fortran Plus Enhanced in order to exploit the parallel array operations provided on the processor grid. Rather than convert directly into FPE, the conversion is performed in two stages (see Figure 7): In the ?rst stage, array operations expressed in single-element terms are converted into whole-array operations. These whole-array operations are similar to those supported by Fortran Plus Enhanced, but they are all denoted as pure functions, whereas some of the FPE operations, such as masked array assignment, are destructive operations. The output form generated by this stage is called the Array Form. In the second stage, which augments the standard functional-to-imperative stage, the Array Form operations are converted into FPE operations. Although it is based on operations supported by FPE, the Array Form is not intended to be DAP speci?c — the operations it supports are generic array-processor operations. The Array Form could thus be used as an intermediate form for other array processors, or for other implementation languages that are based on whole-array operations (such as Fortran90 or High Performance Fortran). Moreover, because the Array Form is a pure, functional form, it is retains a simpler semantics than FPE, facilitating further manipulation such as Common Sub-expression Elimination. In addition to the two stages described above, a stage that uses algebraic properties of vectors and matrices to optimize speci?cations is included. For example, in the speci?cation of POT, the expressions U T AU and AUtransform(B) are evaluated. The matrix algebra stage ensures that the matrix product AU is computed only once, by rewriting these expressions as U T (AU) and (AU)transform(B). This optimization
is obvious, and may seem trivial, but it has considerable effect on the execution performance (since the matrix product operation is so computationally expensive).
Converting to Array Form
In this section we emphasize, in the main, the conversion from single-element to whole-array form; the other stages of the transformation to FPE, though important, are not very interesting. The Array Form is based upon the -calculus augmented with a set of functions that perform generic arrayprocessor operations. The additional functions correspond to the FPE operations discussed in Section 4.1. For example, the following operations are available: an operator +array for the elementwise addition of two arrays (usually the ‘array’ subscript is dropped in the discussion); a function row for extracting a speci?ed row of a matrix; a function sum for summing the elements of a numeric array. In addition, a data-parallel conditional expression, de?ned below, is used:
M then T else F = generate(S i if M@i then T @i else F@i)
where M , T and F are arrays of shape S . The data-parallel conditional constructs an array by merging the elements of two arrays (T and F ): a particular element of the result is drawn from T if the corresponding element of M (the ‘mask’) is true; otherwise the element is drawn from F . The purpose of the Array Form stage of the derivation is to convert array operations expressed using generate and reduce into Array Form operations. For example,
generate( n m] i j] A@ i j] + B@ 1 j])!A + expand rows(n row(B 1))
where expand rows(n V ) denotes a matrix having n rows, each of which is equal to the vector V . The advantage of the second, whole-array form is that it is easy to implement directly on an array processor. To implement directly and ef?ciently the ?rst, single-element form on an array processor would be dif?cult. The strategy that is used in the conversion to Array Form is to simplify the internal structure of applications of generate by propagating generate inwards through arithmetic and other operations contained in generating functions. For example,
generate( n m] i j] A@ i j] + B@ 1 j]) !generate( n m] i j] A@ i j]) + generate( n m] i j] B@ 1 j]) !A + expand rows(n generate( m] j] B@ 1 j])) !A + expand rows(n row(B 1)) Each step of the transformation is based upon algebraic properties of generate and reduce, which
array array array
are discussed below. Propagation through operators converts single-element operations into array operations. Propagation results in expressions such as
generate( n m] i j] A@ i j])
generate( m] j] B@ 1 j])
for which further propagation is impossible. The generating functions are assessed to determine whether or not they correspond to particular forms (such as ‘identity generate’ or ‘row extraction’), which can be implemented ef?ciently on an array processor. Establishing such correspondences is
facilitated by the simpli?ed structure of the transformed generating functions (as compared with the structure of the original generating function). 3 Below, the transformations that convert functional speci?cations into Array Form are discussed. The strategy used in applying these transformations (‘propagation of generate’) is described. Formal proofs that application of the transformations terminates under this strategy, and that application is complete, are given. 5.3.1 Algebraic Identities for generate and reduce
The transformations that convert single-element form into Array Form are based upon algebraic identities for generate and reduce. These identities are listed here in three categories: Propagation rules — which propagate applications of generate into expressions (thereby, for example, converting operators into whole-array form).
Special forms — which convert particular forms of generate into array operations (thereby, for example, extracting a row of a matrix). Optimizations — which enhance the degree of parallelism in expressions (thereby, for example, converting multiple vector operations into a single matrix operation). Propagation Rules 1. In?x Element Operator to In?x Array Operator
generate(S x E bop
E ) generate(S x E ) bop
generate(S x E )
where bop is a binary in?x operator.
A generation constructed from the expression E1 bop E2 is equivalent to the array version of applied to arrays constructed from E1 and E2. For example,
generate(S x E +
E ) generate(S x E ) +
generate(S x E )
2. Unary Element Operator to Unary Array Operator
generate(S x uopE) uop
where uop is a unary operator.
generate(S x E) uop to the array
An array constructed from uopE is equivalent to the elementwise application of constructed from expression E . For example,
generate(S x absE)
3. Promotion from generate
generate(S x E)
y Z x]
generate(S x (( y E1)E2)) (( Z generate(S x E1
))generate(S x E2))
Consider the left side of this identity: an array is constructed in which each element requires the evaluation of expression E2 and the binding of the result to identi?er y. There is no mechanism in
3 Of course, not every residual generating function produced by propagation will correspond to an array processor operation. In such circumstances, ef?cient implementation on an array processor may not be possible.
FPE for constructing such an array in parallel; the construction would have to be implemented in FPE as a sequential loop. However, the binding of E2 to y can, potentially, be performed for all elements in parallel by constructing a separate array, Z , as shown on the right of the identity — the value of E2 for each index x is stored as an element of Z . The original array is constructed as before except that the binding for y is replaced with an access to the appropriate element of Z . It may then be possible to construct both arrays in parallel. For example,
generate(S x ( y (y + sqrt(y))(2 A x]))) ( Z generate(S x (y + sqrt(y)) ))(generate(S x 2 A x])) ( Z generate(S x Z x] + sqrt(Z x])))(generate(S x 2 A x])) ( Z Z + sqrt(Z))(2 A) .
y Z x]
4. Conditional Expressions
generate(S x if E then E else E ) if generate(S x E ) then generate(S x E ) else generate(S x E )
This identity is essentially the de?nition of the data-parallel conditional. Special Forms Array processor programming languages generally include a prede?ned set of optimized methods for performing certain operations — primarily communication operations — commonly required for numerical mathematical algorithms. To produce an ef?cient program, these standard optimizations must be exploited. Thus, it is necessary to identify, from the array expressions within a speci?cation, those expressions that are instances of supported operations. Identifying such expressions is facilitated by the simpli?cation of generating functions that results from the propagation of generate carried out by the preceding set of transformations. 5. Array Identity
generate(S x A x]) A
where Shape(A) = S .
6. Array Constants
generate(S x e) expand(S e) where e is independent of the generating index x and expand(S e) is an array of shape S with each of its element having the value e. For example, generate( n] x 1:0) expand( n] 1:0)
is a vector of length n with each element having the value 1:0.
7. Column or Row Expansion
generate( n m] i j] E) expand cols( m] generate( n] i] E))
where E is independent of j ,
generate( n m] i j] E) expand rows( n] generate( m] j] E))
where E is independent of i. Constructing a matrix by applying a generating function that is independent of one of the indices is equivalent to constructing a vector and duplicating the vector row- or column-wise, as appropriate.
8. Array Patterns
generate( n n] i j] i = j) generate( n n] i j] i) = generate( n n] i j] j) diagonal pattern(n)
where diagonal pattern(n) is an n n mask with each of its diagonal elements having the value true, and all of its other elements having the value false. (Identities exist for other patterns, corresponding to other relations such as <.) The propagation rules apply to the ?rst term in this identity, to give the second term, because the equality operator = has an elementwise equivalent. 9. Permutations
generate( n m] i j] A j i]) transpose(A)
Transpose is the most common permutation. 10. Extractions
where shape(A) =
generate( n] i] A i i]) diagonal(A) generate( n] i] A i k]) column(A k) generate( m] i] A k i]) row(A k) where, in each case, k is independent of i.
where shape(A) = where shape(A) = where shape(A) =
n n] n m] n m]
generate( n m] i j] if i = 1 then 0 else A i ; 1 j]) Ifarray generate( n m] i j] i = 1) then generate( n m] i j] 0) else generate( n m] i j] A i ; 1 j]) ShiftSouth(A) where shape(A) = n m].
An expression of the ?rst form is converted into an expression of the second form by the propagation rules and is then converted into an application of ShiftSouth. Similar rules apply for shifts in other directions, for combinations of shifts (such as a north-east shift) and for unidirectional shifts of magnitude greater than 1. Optimizations To obtain optimum performance from a DAP implementation, it is necessary to augment the preceding rules with others that are designed to take advantage of the particular capabilities of an array processor architecture and, in particular, those of the AMT DAP. For an array processor architecture, it is preferable that a single large data-parallel operation be performed rather than a sequence of smaller data-parallel operations — this means that it is worthwhile to seek to combine or reorder sequences of generate and reduce operations to give a data-parallel operation that applies to the largest possible number of array elements. 12.
where S0 is independent of x, element of operator .
reduce(S x] reduce(S 0 y] E
0 ) reduce(S S 0 x y] E 0
is an identity
denotes the cartesian product of shapes, and
This identity asserts that a reduction, using an operator , of a set of values each of which is itself the result of a reduction using , is equivalent to a single reduction. For example,
reduce( n] x] reduce( m] y] A x y] + 0) + 0) reduce( n m] x y] A x y] + 0)
This optimization establishes a larger parallel reduction from a number of smaller reductions — by converting, in this example, n + 1 vector reductions into a single matrix reduction. This rule can be generalized to reductions in which the initial value is not the identity element of the reducing function. 13.
generate-reduce Swap generate(S
x y] reduce(S
x y] E)
where S2 , v are independent of x, and where array of initial values, respectively.
and v on the right side are an array operator and an
This identity asserts that the evaluation of multiple reductions, each of which produces a single element of a matrix, is equivalent to a single reduction which constructs the complete matrix, using the array version of the reducing function. This optimization is important in the context of the matrix product operation:
generate( n m] x y] reduce( l] z] A@ x z] B@ z y] + 0)) reduce( l] z] generate( n m] x y] A@ x z] B@ z y]) + 0) The left side corresponds to the ijk order of evaluation (with k parallelised); the right side corresponds to the kij order of evaluation (with ij parallelised). As discussed in , the latter order of evaluation can be understood as computing the matrix product by a sequence of n rank-one updates
to the zero matrix. The motivation for this optimization is as follows: the reduce-generate combination can be considered as exhibiting three-dimensional parallelism (the expression E must be evaluated for each combination of x, y and z , in the appropriate ranges). However, the DAP can utilize at most twodimensional parallelism, so that at least one dimension must be processed sequentially. Because reductions tend to exploit parallelism less than other operations (such as elementwise operations)4, it is an optimization to use this identity to arrange for the reduction to be the operation that is performed sequentially. 14.
generate-reduce Combination generate( m] x] reduce( n] y] E v)) reduce rows(generate( m n] x y] E) v)
where n and v are independent of x and reduce form a vector of values.
rows reduces its matrix argument along its rows to
This identity asserts that the evaluation of multiple reductions, each of which creates a single element of a vector, is equivalent to the construction of a matrix followed by a reduction along its rows. 5 This optimization improves performance by increasing the degree of parallelism — if the generate and reduce were not combined, the generate would be evaluated sequentially. For example, matrixvector product is optimized as:
generate( m] x] reduce( n] y] A@ x y] U@ y] + 0)) reduce rows(generate( m n] x y] A@ x y] U@ y]) + 0)
4 The parallel reduction of an vector of length n by an array processor typically requires log (n) steps, whereas the parallel 2 addition of two vectors of length n can be performed in one step. 5 A column-wise combination is more ef?cient than a row-wise combination for some expressions. The transformations used in practice include heuristics to decide which to use.
generate(S x generate(S 0 y E)) generate(S S 0 x y E)
This identity asserts that an array of arrays is considered equivalent to a single, ‘?attened’ array. This equivalence is included for completeness; it is not used in practice since it requires a more complex interpretation of basic operations. For example, array indexing must be ‘curried’ so that, for example, a two-dimensional index applied to a four-dimensional array returns a two-dimensional array. 5.3.2 Transformation Application Strategy
The equivalences in Section 5.3.1, when used left-to-right, constitute the transformations required when converting an abstract functional speci?cation into an ef?cient form suitable for execution on an array processor architecture and, in particular, on the AMT DAP. The rules involve patterns that are disjoint; thus, no transformation interferes with any other transformation (i.e., for a given program section, at most one transformation is applicable). In addition, the rules cannot result in an in?nite sequence of transformations (see next section). Thus, the rules can be applied automatically to transform an element-based functional speci?cation to an array-based speci?cation optimized for an array processor architecture. 5.3.3 Completeness Proof and Normal Form
The DAP transformation strategy propagates generate functions into expressions as far as is possible. This strategy may be viewed as a way of deriving a normal form for generate and reduce, since these functions cannot be driven inde?nitely far into expressions. The existence of a normal form enables the transformations to be applied automatically in the TAMPR system, without the need for human guidance. The basic idea is illustrated by demonstrating how a generate term may be transformed into Array Form. The individual rewrites used in the transformation process are equivalences in the algebra of generate— see Section 5.3.1 — just as the rewrites used in earlier transformation stages are equivalences in the calculus. For simplicity, the discussion concentrates on the propagation rules, and it is assumed that the elements of arrays are scalar values (integers, reals or booleans). Speci?cally, terms are assumed to have the following form: where E is de?ned (using Extended Backus Naur Form) as
generate(S x E)
E :: C j Vj y denotes a tuple of names; N(E) j C denotes a constant; where V denotes an index variable in y; uop E j E bop E j N(E) denotes a function application; and if E then E else E j E and E denote expressions. (( y E )E ) The details of the classes C , V and N are irrelevant in the derivation of the DAP normal form. The application of the transformations can be represented by a recursive tactic T , de?ned by: T(generate(S x E)) = case E of C ! generate(S x C) V ! generate(S x V ) 0 N(E 0 ) ! generate(S x N(E ))
T (generate(S x E )) bop T(generate(S x E )) uop T (generate(S x E 0 )) ( X T(generate(S x:E1 )))T (generate(S x:E2)) else E if ( T (generate(S x E ))) then T (generate(S x E )) else T(generate(S x E )) It is important to note that unary and binary operators (syntactic classes uop and bop) are overloaded: on the
(Rule 1) (Rule 2) (Rule 3) (Rule 4)
1 2 1 2
E bop E uop E 0 (( y E1)E2) if E then E
! ! ! !
X (x) b
left of the rewrites they are applied to individual elements while on the right they are applied to structures. Proposition 1 After transformation, all remaining generate terms have the form tg de?ned by:
tg :: generate(S x C) j generate(S x V ) j generate(S x N(E))
Proof of Proposition 1 De?ne a measure on generating functions; this measure induces an ordering which is used to establish proposition 1 by structural induction. The measure also facilitates a proof of termination of the transformation. The de?nition of is:
The relation < on generating functions is de?ned as E < E 0 (E) < (E 0 ), where < is the usual ‘less than’ relation on natural numbers. The salient point of the ordering on generating functions is that ‘compound’ expressions (unary and binary operator expressions, conditional expressions and -applications) are ‘larger’ than their constituent sub-expressions.
(C) (V ) (N(E)) (uop E) (E bop E ) (if E then E else E ) ((( y E )E ))
=1 =1 =1 = 1+ = 1+ = 1+ = 1+
(E) (E ) + (E ) (E ) + (E ) + (E ) (E ) + (E )
(E) denote the property that all generations occurring in E have the form tg. It is shown that: (T(generate(S x e))) holds for base cases of E (viz. C , V and N(E)), and that if (T (generate(S x e0 ))) holds for all e 0 < e, then (T(generate(S x e))) also holds. Then, by structural induction, (T(generate(S x e))) holds for all e.
Let Base Steps: Consider case C of E . A generation with generating function of this form is left unchanged by T . It is already in the form tg, so (T (generate(S x C))) holds. Similarly for the cases V and N(E). Inductive Steps:
= if E then E else E . (T(generate(S x if E then E else E ))) = (if T (generate(S x E )) then T(generate(S x E )) else T(generate(S x E ))) = (T (generate(S x E ))) ^ (T(generate(S x E ))) ^ (T(generate(S x E ))) = true ^ true ^ true by hypothesis, since E , E , E < E = true
Consider the case E
The cases for unary and binary operators follow similarly. The case more attention:
E = (( x E )E ) requires a little
1 2 2
(T (generate(S x (( y E )E )))) = ((( Z T(generate(S x (E ) ))) T(generate(S x E )))) = (T (generate(S x (E ) ))) ^ (T(generate(S x E )))
1 Z x]
Now the second term in the above, (T (generate(S x E2))), holds by the induction hypothesis, since E2 < E . Since (y) = (Z x]), the substitution of the latter for the former in an expression leaves the measure of the expression exchanged: that is,
y Z x]
) = (E )
y Z x]
Now E1 esis.
< E , so (E )
y Z x]
< E and (T(generate(S x (E )
))) follows by the induction hypoth-
Thus, by structural induction, Proposition 1 holds.
Corollary: It is possible to detect, in the normal form, parallel implementations using rules 5–11.
generate (and reduce) terms which have data-
To illustrate the use of the transformations discussed above, we consider in detail the transformation of part of the function Pot, whose speci?cation is discussed in Section 3.1, The derived implementation of CG is also discussed brie?y.
1. Functional Language Standardization fun Pot:real Array = A:real Array U:real Array B:real Array if (is˙satisfactory(B)) then (U, B) else orthogonalize(mmmult(A, mmmult(U, transform(B))), diagonal(B)) end (mmmult(transpose(U), mmmult(A, U)) ) end end end In?x operators have been converted to applications of functional equivalents and -bindings have been introduced for ML let bindings. 2. Matrix Algebra Optimizations The repeated calculation of matrix evaluated only once.
is recognized and bound to the name AU to ensure it is
fun Pot:real Array = A:real Array U:real Array AU:real Array B:real Array if (is˙satisfactory(B)) then (U, B) else orthogonalize(mmmult(AU, transform(B)), diagonal(B)) end (mmmult(transpose(U), AU)) end (mmmult(A, U)) end end end 3. Unfolding and Simpli?cation fun Pot:real Array = A:real Array U:real Array AU:real Array B:real Array ... end (generate([n, n], [i, j] reduce([n], [k] U[k, i]*AU[k, j], plus, 0.0))) end (generate([n, n], [i, j] reduce([n], [k] A[i, k]*U[k, j], plus, 0.0))) end end end Applications of functions such as mmmult and transpose has been replaced by their de?nitions expressed as generate and reduce operations (see Section 2.2). 4. Array Form by generate
fun Pot:real Array = A:real Array U:real Array AU:real Array B:real Array ... end (reduce([n], [k] generate([n, n], [i, j] U[k, i]*AU[k, j]), plus, 0.0)) end (reduce([n], [k] generate([n, n], [i, j] A[i, k]*U[k, j]), plus, 0.0)) end end end by Element Operator to Array Operator rule 1
; reduce rule 13 -
fun Pot:real Array = A:real Array U:real Array AU:real Array B:real Array ... end (reduce([n], [k] generate([n, n], [i, j] U[k, i])*generate([n, n], [i, j] AU[k, j]), plus, 0.0) ) end (reduce([n], [k] generate([n, n], [i, j] A[i, k])*generate([n, n], [i, j] U[k, j]), plus, 0.0) ) end end end by Expand Special Case rule 7 fun Pot:real Array = A:real Array U:real Array AU:real Array B:real Array ... end (reduce([n], [k] expand˙cols([n], generate([n], [i] U[k, i])) * expand˙rows([n], generate([n], [j] AU[k, j]), plus, 0.0)) ) end (reduce([n], [k] expand˙cols([n], generate([n], [i] A[i, k])) * expand˙rows([n], generate([n], [j] U[k, j]), plus, 0.0)) ) end end end by extraction rule 8
fun Pot:real Array = A:real Array U:real Array AU:real Array B:real Array ... end (reduce([n], [k] expand˙cols([n], row(U, k)) * expand˙rows([n], row(AU, k), plus, 0.0)) ) end (reduce([n], [k] expand˙cols([n], column(A, k)) * expand˙rows([n], row(U, k), plus, 0.0)) ) end end end 5. Common Sub-expression Elimination Common Sub-expression Elimination (CSE) has no effect on the example fragment from POT. Although there is a common element, column(U, k), in the two matrix products it is not ef?cient to make this a common computation in this context. In fact, this operation is implemented using a particular form of DAP addressing so no computation need be performed to create a column of the array U . 6. Functional form to Imperative form subroutine Pot ... block real AU(n, n) do k=1, n, 1 AU = AU+expand˙cols(n, column(A, k)) * expand˙rows([n], row(U, k)) enddo block real B(n, n) do k=1, n, 1 B = B+expand˙cols([n], row(U, k)) * expand˙rows([n], row(AU, k)) enddo ... end end end The reduce operations are translated into loops over the index range. 7. Array Operations to DAP Operations The abstract array operations are converted to the particular (and somewhat arcane) syntax required by Fortran Plus Enhanced.
subroutine Pot ... block real AU(*n, *n) do k=1, n, 1 AU = AU+matc(A( , k), n)*matr(U(k, ), n) enddo block real B(*n, *n) do k=1, n, 1 B = B+matc(U(k, ), n)*matr(AU(k, ), n) enddo ... end end end 8. Fortran Standardization The main section of the Fortran Plus Enhanced implementation of POT is shown in Figure 8.
integer n real tol parameter(tol=1e-15) parameter(n=??) real A(*n, *n), U(*n, *n) real *8 tranU(*n, *n), AU(*n, *n) integer step integer signD(*n, *n) real B(*n, *n), diagB(*n), D(*n, *n) real g611(*n, *n), g598(*n) logical mask(*n, *n) logical g650(*n, *n), g595(*n, *n) . . . continue C
diagB=B(patunitdiag(n), ) g598=abs(diagB) if ((sum(abs(B))-sum(g598))/ . (n*(n-1)).lt.tol) goto 200
Test for convergence
tranU=tran(U) AU=0 do 110 step=1, n AU=AU+matr(A(, step), n)* . matc(tranU(, step), n) 110 continue C
AU := A*U
g595=patlowertri(n) . .and. .not. patunitdiag(n) mask=patlowertri(n) g650=patunitdiag(n) U(mask.and.g650)=1 mask(g650)=.false. g650=g595 D=matr(diagB, n)-matc(diagB, n) signD=1 signD(D.lt.0)=-1 U(mask.and.g650)=(-2*B)/ . (D+signD*sqrt(D*D+4*(B*B))) 126 U(.not.patlowertri(n))=-tran(S) C
U := Transform(B)
B=AU g611=0 do 120 step=1, n g611=g611+matc(tranS(, step), n)* . matr(b(, step), n) 120 continue 125 B=g611
D=0 do 130 step=1, n D=D+matc(AU(step, ), n)* . matr(S(, step), n) 130 continue S=R . . . goto 100 200 continue
Figure 8: Fortran Plus Enhanced implementation of POT The comments have been inserted manually to improve clarity for the reader. The * in the declaration of the matrices indicates that their elements are to be processed in parallel. The iteration of POTiters has been realized by a GOTO loop beginning at line 100 and ending at line 200.
The loop which terminates at line 110 computes the product of matrices A and U . This product is stored since it is used twice: in the computation of B (U T A U ) and in the new eigenvector matrix approximation U (A U transform(B)). The computation of B (the diagonal of which gives the current approximation to the eigenvalues) is completed at line 125.
If B is suf?ciently close to being diagonal (the mean of the absolute values of the off-diagonal elements is suf?ciently close to zero) the loop is exited via the GOTO 200 statement. The following lines, up to line 126, construct the transformation matrix. The de?nition of transform explicitly distinguishes elements in the lower triangle of the transformation matrix from elements in its upper triangle; its implementation on a SIMD architecture thus requires the computation of two matrices (one for lower triangle, one for upper triangle) which are then ‘merged’. However, because the transformation matrix is (ignoring its diagonal) antisymmetric, only one of these matrices need be computed (say, the matrix for the lower triangle); the other matrix can be formed by transposition and negation (as in line 126). Some of the mask manipulation in this part of the computation is unnecessary: no effort has been made to optimize mask expressions since they are very cheap on the DAP. (The grid of single-bit processing elements can manipulate the single-bit representation used for booleans very ef?ciently.) The eigenvector approximation matrix U is updated by the loop terminating at line 130. The orthonormalization of the columns of U is not shown. The FPE implementation of POT is considerably different from its ML speci?cation: the details of the computation of the matrix products and of the transformation matrix would be inaccessible to one unfamiliar with the DAP. The program is ef?cient but it is not easy to read. Of course, it is not intended that the FPE implementation should be read — it is nothing more than a source for processing by the FPE compiler to produce ef?cient machine code for the AMT DAP.
The Fortran Plus Enhanced implementation of CG is shown in Figure 9. The collection of vectors manipulated by the algorithm is realized by four arrays x(*n), etc. The computation of the vectors from which the next approximation is constructed is performed using destructive updates on these arrays; thus there are no separate variables corresponding to x0, etc. The repetition required by the algorithm (expressed using iterate) is implemented using a loop realized by a GOTO occurring at line 13; the loop ends at line 15. Line 2 computes the inner product of r with itself. This value is the measure of the accuracy of the approximation to the solution. If the approximation is suf?ciently accurate, the loop is exited via the GOTO statement on line 4. Otherwise, the next set of values (x0 , r0 , p0 and q0 ) is computed by lines 6 to 12 Note, in particular, that lines 7 and 9 compute the two matrix-vector products: transpose(A)*q A*r0
In the ?rst product, the matrix A is transposed, but no explicit transpose operation occurs in the implementation; rather, row and column operations in the implementation of normal matrix-vector multiplication (i.e. without transposition) are interchanged. This accounts for the slight difference in form between the implementations of the two products. 25
- sumr(A*matc(q, n)) - sumc(A*matr(r1, n))
real*8 A(*n, *n), x(*n), r(*n), p(*n) real*8 q(*n), b(*n), r1(*n), beta, rr integer cnt . . . 1 continue 2 rr = sum(r*r) 3 if (sqrt(rr).lt. 1.0E-14) then 4 goto 15 5 else 6 alpha = rr/sum(q*q) 7 r1 = r-sumr(A*matc(q, n))*alpha 8 beta = sum(r1*r1)/rr 9 q = sumc(A*matr(r1, n))+q*beta 10 x = x+p*b 11 r = r1 12 p = r1+p*beta 13 goto 1 14 endif 15 continue . . . Figure 9: Fortran Plus Enhanced implementation of conjugate gradient Again, the DAP implementation may appear rather ugly since it is not intended for a human reader. The program is, however, an extremely ef?cient implementation that exploits the strengths (and indeed quirks) of the DAP architecture. The implementation makes effective use of the DAP hardware, with all of the vector and matrix operations being performed in fully data-parallel manner. The only unsatisfactory aspect of the implementation is the unnecessary use of the variable r1: the assignment to r1 in line 11 could be replaced with an assignment to r, obviating the need to assign to r later. Ef?ciency could be improved by eliminating two vector assignments and one vector variable.
Execution Performance of Derived Implementations
From the point of view of a user of an implementation, its most important feature (after correctness) is its execution speed. Clear, extensible functional speci?cations are useful only if it is possible to derive fast and ef?cient implementations from them. Examination of the derived implementations reveals that they are highly ef?cient — they make excellent use of the parallel processing capabilities of the DAP. A more rigorous assessment of the execution performance of the derived POT implementation can be made by comparing it with that of a hand-crafted implementation developed independently by a programmer who was very familiar with the target architecture. In Figure 10 the time required to compute one approximation to the eigensystem 6 by a hand-crafted implementation of POT is compared with the time required by the automatically derived implementation; the hand-crafted version has been analyzed in [21, 70]. The execution times for the hand-crafted and automatically derived versions are almost identical. For the larger matrix examples the derived implementation is marginally slower than the hand crafted version
6 The same amount of time is required to generate each successive approximation.
64 128 256
Time per iteration (sec) Hand Crafted Automatically Derived Fortran Plus Enhanced Fortran Plus Enhanced 1.35 1.35 9.30 9.31 69.86 70.30
Figure 10: Execution times of derived and manually-constructed DAP implementations of POT (by between 0.1% and 0.6%). This discrepancy arises from a minor optimization made possible by the particular way in which the hand-crafted version implements the transform operation.
The work presented in this paper addresses many different themes in computing science. Thus it is impossible to provide an exhaustive survey of related work. However, the work described here treats the broad themes of language selection for algorithm speci?cation, functional language compilation and program transformation.
The primitive array functions, generate and reduce (see section 2) used in describing computations should be familiar to those with experience of functional programming languages. The de?nitions presented are, for the most part, natural extensions of the usual de?nitions over lists to de?nitions over arrays. (Those unfamiliar with functional programming languages may consult [44, 7, 30, 63, 67, 26] for an introduction to functional programming and the use of higher-order functions.) No claim is made in respect to the originality of the array functions; they are presented as objects that have proved to be particularly useful in the speci?cation of numerical mathematical algorithms and in the formal manipulation of such speci?cations. Maa?en  proposes data structures and higher order functions over them for the parallel execution of functional programs. The functions employed in the functional speci?cations in this paper are related to these de?nitions. Darlington et al.  use skeletons [28, 23, 61] in high-level speci?cations of algorithms. Skeletons are higher-order functions that describe a repertoire of parallel operations and are used as the building blocks of an algorithm’s speci?cation. Skeletons are intended to separate the meaning of the computation from any tailored parallel form which may be derived from such de?nitions. The primitive functional forms used here may be regarded as simple skeletons in that they may be interpreted as indicating data-parallel execution. In [28, 29] other skeletons are de?ned that are oriented to particular computational models; for example, processor-pipeline and processor-farm skeletons are de?ned. This type of skeleton may be viewed as de?ning an execution model which is suitable for carrying out a particular computation. This approach to algorithm speci?cation is different from the one adopted here; in this paper, it is proposed that a speci?cation should be as free from execution detail as possible — the algorithm speci?cation de?nes only the functions to be implemented and relegates the decisions as to implementation to the transformation phase. It is clear that automatic tools (such as the transformation system suggested here) could not supersede the r? le of the expert programmer; nevertheless, it is interesting to explore how much can be achieved autoo matically. With TAMPR it is possible to apply particular algorithm transformations to achieve the effect of model-oriented skeletons.
The Bird Meertens Formalism (BMF) [65, 5, 6, 56, 3, 52] provides a simple, consistent functional language in which algorithms may be expressed. BMF provides an elegant framework for the study of algorithms, but its utility as a numerical mathematical algorithm speci?cation language is problematical given its list-based approach. The array is fundamental to the natural expression of a large body of numerical mathematical algorithms and to their ef?cient implementation. We contend that, for most numerical mathematical algorithms a functional speci?cation that uses lists to represent arrays is unnatural — for example, consider expressing a basic operation such as matrix transpose using a list representation. Moreover, such list-based speci?cations are unlikely to be amenable to the utilization of the optimization techniques and implementation strategies developed by implementers of numerical mathematical algorithms. This corpus of implementation experience is essential for ef?cient implementation of functional, numerical mathematical speci?cations and thus for the acceptance of functional programming languages for this purpose. Numerical mathematicians readily accept array-based functions as a natural extension to the conventional mathematical notations used in their community. Hains and Mullin  de?ne ML functions that operate on arrays. The dimensionality of the array is expressed by de?ning the structure of the array. However, as with BMF, arrays are represented by lists of elements thereby reducing the readability of speci?cations and impairing its usability for those to whom the work reported here is particularly addressed.
Functional language Compilers
Many functional language compilers generate machine code which is comparable in ef?ciency to that produced from hand-crafted imperative programs; among these are the Orbit Compiler  for the language T, the ALFL language compiler , the compiler for the SISAL language  and the Lazy ML compiler . This body of experience has been drawn upon in the compiler-oriented transformations of the transformational derivations presented here. Many computer systems have been developed speci?cally to support the parallel execution of functional programming languages [24, 40, 45, 64, 48, 38, 53]. Special hardware that supports combinatoric graph reduction offers the possibility of a radical change in the relative performances of functional and imperative languages, thereby reducing the need for the construction of an imperative implementation of a functional speci?cation. Simon Peyton-Jones  gives an excellent survey and description of combinatoric graph reduction. Although attractive in principle, very few special-purpose graph reduction machines have been constructed and none is widely available. Even if a successful graph-reduction machine were built and could yield execution performance comparable to that achieved by procedural programs executed on conventional von Neumann architectures, such a machine is unlikely to be a cost-effective alternative to mass-produced conventional machines. A number of functional languages have been extended to include parallel evaluation primitives. Typically, when using such languages, a programmer speci?es that a process be created to evaluate some expression and evaluation then proceeds until the value generated by the created process is required [33, 55, 37, 49, 34]. Such language forms might serve as a target for transformation derivation or as a standard form to be used in the transformational process. As before, however, our goal is to have speci?cations that are free of execution detail.
A large volume of literature on program transformations and derivations is available. Although it is not the main subject of this paper the interested reader is referred to Partsch  for an overview of various transformations systems and to [25, 27, 72, 66, 42, 47] for discussions of particular transformation systems. A major issue still to be addressed in transformation systems is the control of the derivation process; i.e. the speci?cation of strategies to achieve some goal. The approach advocated here is to de?ne a sequence of normal forms that achieves a goal (the conversion from some initial form to some ?nal form): consideration 28
of strategy is then reduced to ensuring that the transformations convert one normal form into the next. The use of normal forms has been discussed at least as early as 1970 by Boyle  and has been addressed more recently by Hoare . In a recent paper, Boyle  shows how a sequence of normal forms can be used to control transformations that perform partial evaluation of programs. Program transformation has traditionally been used to recast a program into an equivalent but more ef?cient form. The initial and ?nal forms are generally expressed in the same language. An early example is Burstall’s and Darlington’s unfold-fold transformations  which improve the execution ef?ciency of systems of recursive equations. This topic is pursued further in [9, 39, 46, 62]. Again, the work reported in these papers has been employed in the optimization techniques used in the unfolding phase of the transformational method discussed here.
Traditional Imperative Parallel Programming
The majority of programs that are executed by parallel computers are expressed in Fortran, a language which is inherently sequential. Fortran compilers which generate code for parallel systems usually perform extensive program analysis in order to exploit parallel execution. This is achieved primarily by executing multiple iterations of DO loops simultaneously . This area of study is not directly related to the work in this paper, insofar as the results of research is this area are not employed in the derivations presented here. However, the research is important because Fortran is currently the only feasible language for programming many vector and parallel computer systems (and that has consequences for derivations). The intractability of many of the problems that arise in vectorization or parallelization is a major factor motivating research into alternative approaches to programming high-performance computer systems; the work reported here may be viewed as one alternative. Con?guration languages such as those advocated in [31, 51] permit composition of black-box processes by speci?cation of the communication between these. Typically, the processes are expressed in a sequential language, such as Fortran or C, and the communication is by reading and writing to communication ports. Con?guration languages normally require too low a level of detail to be suitable for specifying algorithms, but they might be suitable as target languages for derivations. Imperative languages have been extended to include parallel programming constructs. The extensions range from subroutine libraries, that are little more than interfaces to operating system routines, to entirely new languages such as ADA, which are designed with parallel execution in mind. Of particular relevance in the context of this paper are the array extensions to Fortran provided by languages such as Fortran90 , Connection Machine Fortran , Fortran Plus Enhanced , Fortran-D  and Vienna Fortran . These extensions provide, to some degree, a data abstraction for arrays: many common operations such as the elementwise addition of two arrays are provided as pure functions (denoted by the usual ‘+’ operator). Vienna Fortran is distinguished from the others by its advanced support for data templates, which permit the programmer to de?ne the distribution of data on distributed memory systems. Recently, many of the features of these array-based Fortran dialects have been coalesced into a single language called High Performance Fortran . The language de?nition is still under review and there are, as yet, no widely available HPF compilers. In some ways, the array extensions to Fortran may be viewed as an attempt to introduce into Fortran some of the features of functional languages: expressions permit array operations to be denoted in a high-level, machine-independent manner that allows operations to be succinctly combined and that facilitates analysis. It is thus natural to enquire whether the wide-spread use of array-based Fortran would render irrelevant the work reported in this paper, since programmers would have available array operations that are almost the same as those provided by the array data abstraction used here. We offer the following reasons for replying in the negative: The array-based Fortran dialects fall short of providing complete data abstractions for matrices and vectors; for example, they do not support common linear algebraic operations such as matrix product.
Some of the Fortran dialects do provide module mechanisms for hiding implementation details, but, in general, ef?ciency considerations will probably force programmers to continue using subroutines as their main (if any) decomposition mechanism. What the derivational approach offers over Fortran in any form is a clear separation of the tasks of specifying an algorithm and implementing it. The expression-based array operations are likely to impose just as high overheads on Fortran implementations as on functional implementations. The developers of compilers for the Fortran dialects will have to address issues such as destructively updating arrays, but they will have to address the issue in the context of an already complex compilation system. The derivational approach allows implementation issues to be separated and addressed more methodically. Thus, the chief relevance of the array-based dialects of Fortran for the derivational approach proposed here will probably result from their use as programming models to replace the ill-de?ned model provided by Fortran77. (It should be easier to derive implementations designed for parallel execution using HPF as the implementation language rather than Fortran77.)
In this paper we demonstrated that it is possible mechanically to transform high-level functional speci?cations into highly ef?cient implementations tailored for execution on the AMT DAP array processor. The functional speci?cations are not biased in ways that guarantee ef?ciency of their implementations on a particular machine architecture; rather, they are expressed in ways that provide clear statements of algorithms. Indeed, the example speci?cations may be used as starting points for deriving similarly ef?cient implementations tailored for execution on other machines. The transformations used to produce the implementations presented in this paper are problem independent and may be applied to ML speci?cations of other algorithms. The method may be further re?ned by tailoring the generated code for a particular compiler (for example, producing sectioned Fortran Plus array operations that are tailored for the size of the processor array) and de?ning specialized data transformations (for example, speci?c transformations for sparse matrices).
 AMT, Fortran-Plus Language Enhanced, man 102.01, 1988.  L. Augustsson and T. Johnsson, The Chalmers Lazy-ML Compiler, Computer Journal, 32(2), pp 127141, 1989.  Roland Backhouse, An Exploration of the Bird-Meertens Formalism, Technical report CS8810, Department of Mathematics and Computing Science, University of Groningen, 1988 ?URL:ftp://ftp.win.tue.nl/pub/math.prog.construction/exploration.dvi.Z?.  S. Benker, et al., Vienna Fortran-90, in Proceedings of the Scalable High-Performance Computing Conference, eds R. Voigt and J. Saltz, IEEE Computer Society, 1992, pp 51-59.  R.S.Bird, Algebraic Identities for Program Calculation, Computer Journal, 32(2), pp 122-126, 1989.  R.S.Bird and J.Hughes, The Alpha-Beta Algorithm: An Exercise in Program Transformation, Information Processing Letters, 24, pp 53-57, 1987.  R.S. Bird and P. Wadler, Introduction to Functional Programming, Prentice Hall, 1988.  A. Bloss, P. Hudak and J. Young, An Optimizing Compiler for a Modern Functional Language, Computer Journal, 32(2), 1989, 152-161.
 E. A. Boiten, Improving Recursive Functions by Inverting the Order of Evaluation, Science of Computer Programming, 18(2), 1992, pp 139-179.  J. M. Boyle, A Transformational Component for Programming Language Grammar, ANL-7690 Argonne National Laboratory, July 1970, Argonne, Illinois.  J. M. Boyle, Program Adaptation and Program Transformation, in Practice in Software Adaptation and Maintenance, ed R. Ebert, J. Lueger and L. Goecke, North-Holland Publishing Co., Amsterdam 1980, pp 3-20.  J. M. Boyle and M. N. Muralidharan, Program reusability through program transformation, IEEE Transactions on Software Engineering, Vol. SE-10, No. 5, 1984, pp 574-588  J. M. Boyle and T.J. Harmer, Functional Speci?cations for Mathematical Computations, Constructing Programs from Speci?cations, Proceedings of the IFIP TC2/WG2.1 Working Conference on Constructing Programs from Speci?cations, Paci?c Grove, CA, USA, 13-16 May, 1991, B. M¨ ller, Ed., o North-Holland, Amsterdam, 1991, pp 205-224  James M. Boyle, Abstract programming and program transformations—An approach to reusing programs. in Software Reusability, Volume I, ed. Ted J. Biggerstaff and Alan J. Perlis ACM Press (Addison-Wesley Publishing Company), New York, NY, 1989, pp 361-413.  J. M. Boyle and T. J. Harmer, A Practical Functional Program for the CRAY X-MP, Journal of Functional Programming, Vol. 2, No. 1, 1992, pp 81-126.  J.M. Boyle, Towards Automatic Synthesis of Linear Algebra Programs Production and Assessment of Numerical Software, M. A. Hennell and L. M. Delves, eds., Academic Press, 1980, pp 223-245.  James M. Boyle, Automatic, Self-adaptive Control of Unfold Transformations, presented at PROCOMET ’94, IFIP Working Conference on Programming Concepts, Methods and Calculi, San Miniato, Italy, June 6-10, 1994, proceedings to be published by North-Holland/Elsevier.  R. M. Burstall and J. Darlington, A Transformation System for Developing Recursive Programs, ACM Journal, 24(1), 1977, pp 44-67.  M. Clint, et al., Towards the construction of an eigenvalue engine, Parallel Computing, 8, 1988, pp 127-132.  M. Clint, et al., A Comparison of two Parallel Algorithms for the Symmetric Eigenproblem, Intern’l Journal of Computer Mathematics, 15, 1984, pp 291-302.  M. Clint and J. S. Weston and C. W. Bleakney, Comparison of Parallel Fortran Environments on the AMT DAP 510 for a Linear Algebra Application, Concurrency: Practice and Experience, 6(3), May 1994, pp 193–204.  CM Fortran Reference Manual, TMC, Thinking Machines Corporation, Cambridge, MA, USA, 1991, October  M Cole, Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press 1989.  A. Contessa, et al., MaRS, a Combinator Graph Reduction Multiprocessor, in PARLE ’89 Parallel Architectures and Languages Europe, I, eds E. Odijk, M. Rem and J. C. Syre, LNCS 365, SpringerVerlag, 1989, pp 176-192.  M. S. Feather, A System for Assisting Program Transformation, ACM Programming languages, 4(1), 1982, pp 1-20.  J.T. Feo, D. Cann, R.R. Oldehoeft, A Report on the SISAL Language Project, Journal of Parallel and Distributed Computing, v. 10, no. 4, December 1990, pp 349–366.  J. Darlington, et al., A Functional Programming Environment Supporting Execution, Partial Execution and Transformation, in PARLE ’89 Parallel Architectures and Languages Europe, I, eds E. Odijk and M. Rem and J. C. Syre, LNCS 365, Springer-Verlag, pp 286-305. 31
 J.Darlington, et al., Parallel Programming Using Skeletons, in PARLE’93: Parallel Architectures and Languages Europe, LNCS 694, eds Arndt Bode and Mike Reeve and Gottfried Wol, Springer-Verlag, June 1993.  J.Darlington, Yi-ke Guo and Hing Wing To, Structured Parallel Programming: Theory Meets Practice, Draft paper obtained from the ?rst author, January 1995.  A.J. Field and P.G. Harrison, Functional Programming, Addison Wesley, 1988.  I. Foster, et al., Productive Parallel Programming: The PCN Approach, Scienti?c Programming, 1(1), 1992, pp 51–66.  G. Fox, et al., Fortran D Language Speci?cation, Research Report, Rice University, January 1992.  R. Gabriel and J.McCarthy, QLISP, in Parallel Computing and Computers for AI, Kluwer Academic, pp 63-89.  A. Giacalone, et al., Facile: A Symmetric Integration of Concurrent and Functional Programming, Journal of Parallel Computing, 18(2), 1989, pp 121-160.  A. Gill, et al. A Short Cut to Deforestation, in FPCA, 1993, ?URL:ftp://ftp.dcs.gla.ac.uk/pub/glasgow-fp/papers/deforestation-short-cut.ps.Z?  G.Hains amd L.M.R. Mullin, Parallel Functional Programming with Arrays, The Computer Journal, 36(3), 1993, pp 238-245.  R.H. Halstead, Parallel Computing Using Multilisp, in Parallel Computation and Computers for AI, ed J.S. Kowalik, Kluwer Academic, pp 21-40.  P. G. Harrison and M. Reeve, The Parallel Graph Reductions Machine ALICE, in Graph Reduction, eds J. H. Fasel and R. M. Keller, LNCS 279, Springer-Verlag, 1986, pp 181-202.  P. G. Harrison and H. Khoshnevisan, Algebraic Transformation Techniques for Functional Languages, Computer Journal, 31(3), 1988, pp 229-242.  L. O. Hertzberger and W. G. Vree, A Coarse Grain Parallel Architecture for Functional Languages, in PARLE ’89 Parallel Architectures and Languages Europe, I, eds E. Odijk, M. Rem and J. C. Syre, LNCS 365, 1989, Springer-Verlag, pp 269-285.  High Performance Fortran Language Speci?cation, Version 1.1 High Performance Fortran Forum, Nov 10 1994, ?URL:http://www.hensa.ac.uk/parallel/documents/hpf/hpf-v11.ps.gz?  D. Hildum and J. Cohen, A Language for Specifying Program Transformations, IEEE Transactions on Software Engineering, 16(6), 1990, pp 630-638.  C. A. R. Hoare, et al., Normal Form Approach to Compiler Design, Acta Informatica, 30(8), 1993, pp 701-739.  J. Hughes, Why Functional Programming Matters, The Computer Journal, Vol. 32, No. 2, 1989, p. 98.  S. Hwang and D. Rushall, The nu-STG machine: a parallelized Spineless Tagless Graph Reduction Machine in a distributed memory architecture, in Proceedings of the 4th Int. Workshop on the Parallel Implementation of Functional Languages, 1992, ?URL:ftp://ftp.informatik.rwth-aachen.de/pub/reports/1992/92-19.dir/92-19-21.ps.gz?  O. Kaser, et al., On the Conversion of Indirect to Direct Recursion, ACM Letters on Programming Languages and Systems, 2, 1993, pp 151-164.  E. W. Karlsen, et al., The PROSPECTRA System: A Uni?ed Development Framework, in Algebraic Methodology and Software Technology (AMAST ’91), eds M. Nivat, C. Rattray, T. Rus and G. Scollo, Springer-Verlag, 1991, pp 421-433.
 J.A. Keane, An Overview of the Flagship System, Journal of Functional Programming, 4(1), 1994, pp 19-45.  D.A.Krantz, R.H Halstead and E Mohr, MulT: A High-Performance Parallel Lisp, ACM SIGPLAN Notices, 24(7), 1989, pp 81-90.  A. Maa?en, Parallel Programming with Data Structures and Higher Order Functions, Science of Programming, 18, 1992, pp 1-38.  J. Magee, et al., An Overview of the REX Software Architecture, 2nd IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, 1990.  G. Malcolm, Data Structures and Program Transformations, Science of Computer Programming, 14, 1990, pp 255-279  Shogo Matsui, et al., SYNAPSE: A Multi-microprocessor Lisp Machine with Parallel Garbage Collector, in Parallel Algorithms and Architectures, LNCS 269, Springer-Verlag, 1987, pp 131-137.  C. McCrosky, Intermediate Container Removal, Computer Languages, 16(2), 1991, pp 179-195.  P. F. McGehearty and E. J. Krall, Potentials for Parallel Execution of Common Lisp Programs, in Proceedings of the 1986 International Conference on Parallel Processing, eds K. Hwang, S. M. Jacobs and E. E. Swartzlander, 1986, pp 696-702.  L. Meertens, Constructing a Calculus of Programs, in Mathematics of Program Construction, LNCS 375, Springer-Verlag, 1989, pp 66-90.  Metcalf, Michael and Reid, John, Fortran 90 Explained, Oxford University Press, Oxford Science Publications, 1990, ISBN 0-19-853772-7.  Jagdish J. Modi, Parallel Algorithms and Matrix Computations, Oxford University Press, 1988, ISBN 0-19-859670-7  Massively Parallel Computing with the DAP, Parkinson, Dennis and Litt, John (Eds.), Research Monographs in Parallel and Distributed Computing, The MIT Press, 1990, ISBN 0-273-08809-2  H. Partsch and R. Steinbr¨ ggen, Program Transformation Systems, ACM Computing Surveys, 15(3), u 1983, pp 199-236.  S. Pelagatti, A Methodology for the Development and Support of Massively Parallel Programs, PhD Thesis, Universita Delgi Studi di Pisa, 1993.  A. Pettorossi and R. M. Burstall, Deriving very Ef?cient Algorithms for Evaluating Linear Recurrence Relations using the Program Transformation Technique, Acta Informatica 18, 1982, pp 181206.  S.L. Peyton-Jones, The Implementation of Functional Programming languages, Prentice-Hall, New York, 1987.  S.L. Peyton Jones, et al., High-Performance Parallel Graph Reduction, in PARLE Parallel Architectures and Languages Europe, I, eds E. Odijk and J.-C. Syre, LNCS 365, Springer-Verlag, pp 193.  D.B. Skillicorn, Architecture-Independent Parallel Computation, IEEE Computer, 23(12), pp 38-50, 1990.  D. R. Smith, KIDS: A Semiautomatic Program Development System, IEEE Transctions on Software Engineering, 16(9), pp 1024-1043.  GL Steele, Common Lisp, Digital Press, 1986.  P. Wadler, Deforestation: Transforming Programs to Eliminate Trees, Journal of Theoretical Computer Science, 73, 1990, pp 231-248.  A. Wilstr¨ m, Functional Programming using Standard ML, Prentice Hall, London 1987. o
 J. Weston and M. Clint, Two Algorithms for the Parallel Computation of Eigenvalues and Eigenvectors of Large Symmetric Matrices using the ICL DAP, Parallel Computing, 13, 281-288, 1990.  J. Weston, et al., The Parallel Computation of Eigenvalues and Eigenvectors of Large Hermitian Matrices using the AMT DAP 510, Concurrency: Practice and Experience, Vol 3(3), 179-185, June 1991.  J. A. Yang and Young-il Choo, Parallel-Program Transformation using a Metalanguage, ACM SIGPLAN Notices 26(7), 1991, pp 11–20.  H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers, Frontier Series, ACM Press, 1990, ISBN 0-201-17560-6.