|
- We present here some performance evaluation of our compiler technology
for Cell.
Performance improvement obtained by optimized
SPE code generation is first shown in Figure 1.

Figure 1. Performance improvement obtained by optimizing for the SPE.
Figure 1 presents the reduction in program execution time for each optimization,
relative to the performance of the original compiler (standard optimizations
at O3 level, scheduled for the SPE resource and latency model). We report
an average reduction of 22%, ranging from 11 to 51%. The first gain, in
green, corresponds to the benefits of bundling instruction for dual-issuing.
The second gain, in yellow, corresponds to the additional benefit of compiler-assisted
branch hints. The third gain, in orange, corresponds to the additional
benefit of compiler instruction fetch handling. Its impact is most notable
for high memory bandwidth benchmark such as matrix multiply and saxpy.
The last gain, in red, corresponds to the additional gain of handling
subword computation in the backend of the compiler.
Performance improvement obtained by automatic
SIMD code generation for the SPE is shown in Figure 2.

Figure 2. Performance improvement due to automatic SIMD code generation.
Figure 2 presents the speedup factors achieved when automatically simdizing
sequential code kernels. Comparison are performed at the same level of
optimization, including high-level, interprocedural optimizations in addition
to all of the SPE optimizations reported in Figure 1. We report an average
speedup factor of 9.9, ranging from 2.4 to 26.2.
Notice that speedup can be super-linear as executing scalar code on the
SPE may require read-modify-write semantics in order to write one byte
of data. Such overhead is avoided in successfully simdized code.
Performance improvement obtained using automatic
work distribution is shown in Figure 3.

Figure 3. Performance improvement due to automatic work distribution
Figure 3 presents the speedup in execution time obtained when the compiler
automatically distributes work among the PPE and 8 SPE cores. The baseline
for the comparison is the execution time on the PPE alone, using code
compiled at the highest level of optimization. The blue bars show speedups
obtained when we only use the software cache mechanism to transfer all
application data in and out of the SPE local stores. The red bars show
speedups obtained when some of these transfers are optimized to use static
buffering, instead of being cached on demand. For the two benchmarks Swim
and Mgrid from the SPEC OpenMP 2001 suite, we report an average speedup
factor of 7.7. The improved performance stems from eliminating software
cache lookup overhead, transferring more data per DMA operation, and overlapping
DMA transfers with SPE computation.

|