|
Targets: VMX (JS20, Cell PPE), SPE (Cell),
BlueGene/L (Double Hummer)
The SIMD vectorization (a.k.a. simdization)
is implemented in IBM's XL product compiler, and thus supports multiple
programming languages (e.g., C, C++, and Fortran), and multiple target
machines (e.g., SPU, VMX, and BG/L).
As shown below, successful simdization involves several
tasks that closely interact with each other. First Single Instruction
Multiple Data (SIMD) parallelism must be extracted from an application,
which may involve extraction from a basic-block, a loop, or a combination
of both. Once that parallelism is extracted, various additional SIMD hardware
constraints must be satisfied, such as memory alignment requirements,
physical vector length, and hardware instruction set.

Our approach uses a virtual vector representation [ICS05]
as an initial model to successfully extract SIMD parallelism from multiple
sources in the code. Virtual registers have no alignment constraint and
have arbitrary lengths. The virtual vectors are then progressively lowered
in level to the physical SIMD registers of the target machine in successive
de-virtualization steps.
A summary of simdization features:
- Loop-level simdization
- Loops with induction/reduction/local variables, invariant computation,
stride-one array accesses
- Loops with small constant trip count
- Loops with mixed data types
- Basic-block level simdization
- Unrolled loops
- Computation on adjacent members of aggregates
- Scalar packing
- Quasi-isomorphic packing
- Aggregate copy statement
- Alignment handling
- Data realignment on the fly (only when data are relatively misaligned)
- Minimizing data realignment overhead
- Versioning for relative alignment (for BG/L)
- SIMD prologue/epilogue for misaligned stores
- Scalar prologue/epilogue for misaligned stores to avoid false
sharing and for memory protection
- Other optimizations
- Diagnostic information on whether loops are simdizable and reasons
for not being simdized
- Interprocedure alignment analysis
- Idiom recognition to exploit special hardware instructions
- Loop distribution to isolate simdizable statements
- Array recovery for loops with pointer arithmetics
- Runtime dependence analysis
- And a lot more to come ....
Examples of code extract handled by our frameork are found here.
Embedded in the Toronto Portable Optimizer (TPO) component of the XL
compiler, the simdization optimization leverages a rich set of high-level
optimizations such as interprocedure analysis and loop transformations.
Here we focus on describing the new components that are introduced specifically
for simdization purposes. There are 6 new phases. The first 3 phases are
mainly concerned with extracting SIMD parallelism from various code structures,
e.g., basic blocks or loops.
Phase 1: Basic-Block Level Aggregation. During
this phase, we aggregate isomorphic statements that address consecutive
memory locations into virtual vectors. This phase is particularly successful
at extracting SIMD parallelism in unrolled loops and consecutive elements
of structs, such as 3-dimensional coordinates found in multimedia applications.
It also handles semi-isomorphic statements, for example statements that
are nearly identical but for an add versus a subtract operation.
Phase 2: Short-Loop Aggregation. During this
phase, we aggregate entire loops with small trip counts into virtual vectors.
This approach is successful in eliminating short innermost loop and may
thus enable further SIMD aggregation within the second innermost loop.
Typical examples are FIR based algorithms.
Phase 3: Loop-Level Aggregation. During this
phase, we aggregate instances of a statement in consecutive iterations
of a loop. This phase is particularly successful at extracting SIMD parallelism
in loops and recognizing special patterns such reductions and linear recurrences.
The blocking factor is selected so that the length of each virtual vector
is a multiple of the physical vector length: 16 bytes in the VMX/Cell/BGL
architectures.
These first 3 phases proceed regardless of the alignments and lengths
of the vectors, as if targeting an idealized SIMD machine. This simplifies
the extraction phases greatly, especially in the presence of mixed data
sizes. Each phase extracts SIMD parallelism, treating input scalar or
virtual vectors (generated by a prior phase) in a similar fashion.
The next 3 phases deal with specific constraints of SIMD units, such
alignment, physical vector length, and SIMD instruction sets.
Phase 4: Alignment Devirtualization. During this
phase, we satisfy the alignment constraint which states that the data
being logically operated on by a SIMD operation must reside in the same
slot of their respective input and output SIMD registers. In other words,
when adding a[i] and b[i+1] in SIMD fashion, we must ensure
that both a[0] and b[1] values are in the same relative
position in their respective register. Data realignment instructions are
inserted only for relatively misaligned memory streams. We attempt to
minimize the number of data realignment instructions, both in the presence
of compile-time [PLDI04] and runtime
[CGO05] misalignment.
More information on the impact of alignment and our approach to minimize
the impact of misalignment is found here.
After the completion of this phase, all virtual vector accesses are to
aligned data regardless of the compile time/runtime nature of their alignment.
A virtual vector may still have a virtual length, i.e. a length that is
too long with respect to the physical vector registers provided by the
underlying architecture.
Phase 5: Length Devirtualization. During
this phase, long vectors are chunked into physical length vector registers,
or they may revert back to scalar statements. This phase also handles
data size conversion [CGO05], e.g.,
generating twice as many instructions for 4-byte integer expressions that
use results generated by 2-byte short integer expressions.
After this phase, all virtual vectors are aligned and have the physical
vector length, namely they are compatible with the physical vector registers
provided by the underlying SIMD architecture.
Phase 6: SIMD Code Generation. Up to this
point, most SIMD instructions were generic operators such as "add,"
"mult," or primitive data reorganization instructions. During this phase,
we generate SIMD instructions that are specific to the target SIMD units.
After this last phase, the simdization process is complete and the SIMD
operations can be safely scheduled and register allocated.

|