A worldwide first: An IBM team quantifies the power- and performance-efficiency of parallelism using real-world workloads.
If semiconductor manufacturers are to thrive, they must embrace new performance techniques over traditional processor architectures, such as those based on Instruction Level Parallelism. While conventional manufacturers are still chasing Gigahertz, IBM is conducting seminal research that will change the way systems are conceived across the semiconductor industry.
Building on systems expertise and integrating disciplines from semiconductor manufacturing to applications programming, IBM researchers took a multi-faceted approach -- based on Power Architecture -- to keep delivering performance growth to IBM customers. They exploited data parallelism at the inner loop level as well as at the outer loop level parallelism by using multi-threading -- the key to building large systems. Moreover, in order to exploit coarse-grain parallelism in large processing-intensive data sets, researchers started designing systems with many processors that can work together efficiently.
Parallelism is good not only for performance; it also improves power efficiency. To quantify the power and performance efficiency of different types of parallelism, IBM researchers performed power and performance measurements on a large-scale Blue Gene system (see Figure 1) for a range of important scientific applications, such as molecular dynamics and weather modeling. And for the first time, they also demonstrated quantitatively the power and performance efficiency of different forms of parallelism based on a broad study of key high-performance computing (HPC) workloads, including NAMD, SPHOT, UMT2K and WRF. Their conclusion: Being frugal with power in the new era of power-constrained design ultimately leads to higher performance.
Figure 1: Blue Gene/P System
Microprocessor power dissipation, and ultimately heat generation, are controlled by the classic equation P = CV²f . Thus, the traditional approach of increasing voltage V and frequency f to increase performance leads to a cubic power increase. To double the performance, eight times as much power would have to be spent. In comparison, doubling the number of cores also doubles the performance, and only doubles power dissipation.
Figure 2 shows the power and performance benefits of parallelism for a real application on a Blue Gene system known as UMT2K.* As seen here, smaller is better for all of the metrics: execution time t, power, energy E, and power/performance-efficiency E t and E t². Doubling the number of processors in a system doubles the power, but cuts the execution time t in half. Hence, the energy E spent computing the result stays roughly constant. The power and performance efficiency increases dramatically because the work is performed faster.
Data-level parallelism is even more efficient than thread-level parallelism. Data-level parallelism is exploited by using single instruction, multiple data (SIMD) instructions, where each instruction operates on several data elements simultaneously, thus reducing the number of requisite instructions. It also can cut the execution time almost in half, but for a power increase of only five percent, leading to better energy dissipation than thread-level parallelism and even better energy efficiency.
Combining parallelism at both the data and the thread level yields the best result of all -- with the highest performance and the best power/performance efficiency.
Figure 2: Impact of thread level and data-level parallelism
on UMT2K* power and performance
Finally, IBM researchers studied power and performance scaling behavior by using NAMD,** another popular application. Following the principle that if two cores are good, 20,000 cores must be even better, researchers found that increasing the number of cores in a massively parallel system, such as Blue Gene, not only continually improves performance, but also increases power and performance efficiency as the system size is scaled up. The power efficiency with delivered performance of massively parallel system made the Blue Gene system Number One on the Top500 list of the world’s fastest supercomputers and Number One on the new Green500 list of the world’s most energy efficient supercomputers.
Figure 3: NAMD** scaling
* UMT2K is a 3D, deterministic, multigroup photon transport program for unstructured meshes.
** NAMD is a moldecular dynamics code that simulates protein folding.
See more about the author of this article:
Valentina Salapura, on behalf of the Blue Gene team.
V. Salapura, R. Walkup and A. Gara. Exploiting Workload Parallelism for Performance and Power Optimization. IEEE Micro 26(5), September 2006. [ download ]
V. Salapura. Next Generation Supercomputers: Exploiting Innovative Massively Parallel System Architecture to Facilitate Breakthrough Scientific Discoveries. Grace Hopper Celebration of Women in Computing 2007 (GHC 2007). October 2007. [ download ]
V. Salapura. Understanding Technology and Its Metrics for Future Systems. CRA-W Princeton Computer Architecture Summer School. July 2006. [ download ]
V. Salapura. Power and Performance Optimization at the System Level. ACM International Conference on Computing Frontiers . May 2005. [ download ]
V. Salapura. The Future of Supercomputing: Optimizing Performance with Power-Efficient System Design. International Symposium on High Performance Computing ISHPC-VI. September 2005.
Q&A: Dr. Valentina Salapura. An overview of next-generation computing. HPC Wire.
Last updated November 15, 2007