Massive parallelism for power and performance efficiency

Innovation Matters


A worldwide first: An IBM team quantifies the power- and performance-efficiency of parallelism using real-world workloads.

If semiconductor manufacturers are to thrive, they must embrace new performance techniques over traditional processor architectures, such as those based on Instruction Level Parallelism. While conventional manufacturers are still chasing Gigahertz, IBM is conducting seminal research that will change the way systems are conceived across the semiconductor industry.

Building on systems expertise and integrating disciplines from semiconductor manufacturing to applications programming, IBM researchers took a multi-faceted approach -- based on Power Architecture -- to keep delivering performance growth to IBM customers. They exploited data parallelism at the inner loop level as well as at the outer loop level parallelism by using multi-threading -- the key to building large systems. Moreover, in order to exploit coarse-grain parallelism in large processing-intensive data sets, researchers started designing systems with many processors that can work together efficiently.

Parallelism is good not only for performance; it also improves power efficiency. To quantify the power and performance efficiency of different types of parallelism, IBM researchers performed power and performance measurements on a large-scale Blue Gene system (see Figure 1) for a range of important scientific applications, such as molecular dynamics and weather modeling. And for the first time, they also demonstrated quantitatively the power and performance efficiency of different forms of parallelism based on a broad study of key high-performance computing (HPC) workloads, including NAMD, SPHOT, UMT2K and WRF. Their conclusion: Being frugal with power in the new era of power-constrained design ultimately leads to higher performance.


Figure 1: Blue Gene/P System


Microprocessor power dissipation, and ultimately heat generation, are controlled by the classic equation P = CV²f . Thus, the traditional approach of increasing voltage V and frequency f to increase performance leads to a cubic power increase. To double the performance, eight times as much power would have to be spent. In comparison, doubling the number of cores also doubles the performance, and only doubles power dissipation.

Figure 2 shows the power and performance benefits of parallelism for a real application on a Blue Gene system known as UMT2K.* As seen here, smaller is better for all of the metrics: execution time t, power, energy E, and power/performance-efficiency E t and E t². Doubling the number of processors in a system doubles the power, but cuts the execution time t in half. Hence, the energy E spent computing the result stays roughly constant. The power and performance efficiency increases dramatically because the work is performed faster.

Data-level parallelism is even more efficient than thread-level parallelism. Data-level parallelism is exploited by using single instruction, multiple data (SIMD) instructions, where each instruction operates on several data elements simultaneously, thus reducing the number of requisite instructions. It also can cut the execution time almost in half, but for a power increase of only five percent, leading to better energy dissipation than thread-level parallelism and even better energy efficiency.

Combining parallelism at both the data and the thread level yields the best result of all -- with the highest performance and the best power/performance efficiency.


Figure 2: Impact of thread level and data-level parallelism
on UMT2K* power and performance

Finally, IBM researchers studied power and performance scaling behavior by using NAMD,** another popular application. Following the principle that if two cores are good, 20,000 cores must be even better, researchers found that increasing the number of cores in a massively parallel system, such as Blue Gene, not only continually improves performance, but also increases power and performance efficiency as the system size is scaled up. The power efficiency with delivered performance of massively parallel system made the Blue Gene system Number One on the Top500 list of the world’s fastest supercomputers and Number One on the new Green500 list of the world’s most energy efficient supercomputers.


Figure 3: NAMD** scaling

* UMT2K is a 3D, deterministic, multigroup photon transport program for unstructured meshes.
** NAMD is a moldecular dynamics code that simulates protein folding.

See more about the author of this article:
Valentina Salapura, on behalf of the Blue Gene team.


Related Publications  

V. Salapura, R. Walkup and A. Gara. Exploiting Workload Parallelism for Performance and Power Optimization. IEEE Micro 26(5), September 2006. [ download ]

V. Salapura. Next Generation Supercomputers: Exploiting Innovative Massively Parallel System Architecture to Facilitate Breakthrough Scientific Discoveries. Grace Hopper Celebration of Women in Computing 2007 (GHC 2007). October 2007. [ download ]

V. Salapura. Understanding Technology and Its Metrics for Future Systems. CRA-W Princeton Computer Architecture Summer School. July 2006. [ download ]

V. Salapura. Power and Performance Optimization at the System Level. ACM International Conference on Computing Frontiers . May 2005. [ download ]

V. Salapura. The Future of Supercomputing: Optimizing Performance with Power-Efficient System Design. International Symposium on High Performance Computing ISHPC-VI. September 2005.


Highlights

Q&A: Dr. Valentina Salapura. An overview of next-generation computing. HPC Wire.

Last updated November 15, 2007

Innovator's corner  

Valentina SalapuraValentina Salapura Researcher

What's the potential for the work you're doing?
Supercomputers are the prime enablers of computational science. Computational science offers a new way to explore and understand our world in ways that are not open to traditional natural sciences. By simulating systems we can analyze and learn to understand the behavior of natural processes. This enables us to get insights into areas previously not accessible with traditional sciences, such as physical and chemical processes at the nano-scale, to understand the creation of matter and the structure of the universe, and the causes of diseases and illnesses.

What is the most interesting part of your research?
Computer Architecture is going through tectonic shifts right now, and the way we think of computers will change drastically over the next few years. Only a few years ago, building high performance systems meant not caring about power. Today, building a high performance systems means worrying about power first. Other challenges are also on the horizon with new technology generations: small feature sizes and low voltage operation make designing reliable systems more exciting and more challenging at the same time.

As computing systems are everywhere, we have a mandate to make them use less power. I am thrilled that the system we built is the most power-efficient supercomputer as reflected in its top rank in the “Green500,” an important new metric to energize the information technology community to optimize computers to be energy efficient.

Who or what inspired you to go into this feild?
My father was a physicist, so I was surrounded by science from a very young age. I was always interested in how science can improve people’s lives, and information technology appeared to be a key discipline. With the technologies we are developing today to benefit medical research and model environmental processes, we are delivering on this promise.

What is your favorite invention of all time?
The printing press has had profound influence on the way knowledge and education can be spread, and knowledge has become more accessible and democratic.

Research team  

Gerard Kopcsay

Gerard Kopcsay

Bob Walkup