IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Performance Evaluation


 

Performance Evaluation 

We present here some performance evaluation of our compiler technology for Cell.

Performance improvement obtained by optimized SPE code generation is first shown in Figure 1.

spe-perf
Figure 1. Performance improvement obtained by optimizing for the SPE.

Figure 1 presents the reduction in program execution time for each optimization, relative to the performance of the original compiler (standard optimizations at O3 level, scheduled for the SPE resource and latency model). We report an average reduction of 22%, ranging from 11 to 51%. The first gain, in green, corresponds to the benefits of bundling instruction for dual-issuing. The second gain, in yellow, corresponds to the additional benefit of compiler-assisted branch hints. The third gain, in orange, corresponds to the additional benefit of compiler instruction fetch handling. Its impact is most notable for high memory bandwidth benchmark such as matrix multiply and saxpy. The last gain, in red, corresponds to the additional gain of handling subword computation in the backend of the compiler.

Performance improvement obtained by automatic SIMD code generation for the SPE is shown in Figure 2.

simd perf
Figure 2. Performance improvement due to automatic SIMD code generation.

Figure 2 presents the speedup factors achieved when automatically simdizing sequential code kernels. Comparison are performed at the same level of optimization, including high-level, interprocedural optimizations in addition to all of the SPE optimizations reported in Figure 1. We report an average speedup factor of 9.9, ranging from 2.4 to 26.2.

Notice that speedup can be super-linear as executing scalar code on the SPE may require read-modify-write semantics in order to write one byte of data. Such overhead is avoided in successfully simdized code.

Performance improvement obtained using automatic work distribution is shown in Figure 3.

mimd perf
Figure 3. Performance improvement due to automatic work distribution

Figure 3 presents the speedup in execution time obtained when the compiler automatically distributes work among the PPE and 8 SPE cores. The baseline for the comparison is the execution time on the PPE alone, using code compiled at the highest level of optimization. The blue bars show speedups obtained when we only use the software cache mechanism to transfer all application data in and out of the SPE local stores. The red bars show speedups obtained when some of these transfers are optimized to use static buffering, instead of being cached on demand. For the two benchmarks Swim and Mgrid from the SPEC OpenMP 2001 suite, we report an average speedup factor of 7.7. The improved performance stems from eliminating software cache lookup overhead, transferring more data per DMA operation, and overlapping DMA transfers with SPE computation.




    About IBMPrivacyContact