BlueGene/L: Benchmarking and Applications Performance

Innovation Matters


This project looks at a broad set of high performance computing applications, including life sciences, financial modeling, hydrodynamics, quantum chemistry, molecular dynamics, astronomy, space research and climate modeling. For example, the Car-Parrinello Molecular Dynamics (CPMD) application on BG/L will increase by an order of magnitude the time window that we can explore using accurate ab-initio MD - this is especially important for bio-related applications (e.g., for a timely and accurate study of complex enzymatic reactions or molecule/protein interactions). It will also increase by an order of magnitude the size (number of atoms) of the system we can simulate using ab-initio methods - this is especially important in material science (e.g., in the simulation of complex interfaces such as the silicon/high-k oxide interface).


The Blue Mattercode will allow researchers to study protein folding, protein dynamics and the interactions of protein complexes and aggregates. Proteins are essential to life, and their malfunction is responsible for many diseases, including cystic fibrosis, mad cow and Alzheimer's. Scaling the high-order multiscale modeling environment (HOMME) code from National Center for Atmospheric Research (NCAR) would allow direct numerical simulation of cloud processes on a global scale, improving the accuracy of climate models. The quantum chromodynamics (QCD) computation being tuned on BG/L is of basic importance to cosmology and early universe physics. The locally self-consistent multiple scattering (LSMS) code allows the study of electronic structures in nanoparticles, with applications to the design of hard disk drives, for example. The FLASH code is a multi-physics simulation code designed to solve nuclear astrophysical problems related to exploding stars. The simulation performed on BG/L is the deflagration phase of a Type-1a supernova beginning with the deflagration initiated near the center of the white dwarf star.

BlueGene
IBM's Blue Gene/L supercomputer

To fully realize the potential of BG/L technology, these codes need to be scaled to unprecedented levels of parallelism, to over a hundred thousand processors for the large BG/L system being delivered to Lawrence Livermore National Laboratory.To begin with, researchers need to make sure they get good single node performance because that provides the baseline for improvements through exploitation of multi-node parallelism. Successful scaling requires a careful attention to stamping out serial components from the code so that Amdahl's Law does not kill the project. They need creative approaches to reduce the overhead of passing messages between nodes to ensure efficient parallel execution.

BG/L offers a rich platform with challenges on each of the above goals. It provides a two-way Single Instruction, Multiple Data (SIMD) floating point unit on each central processing unit (CPU), where successful exploitation requires exposing fine grain parallelism and also dealing with data alignment issues. As with any scalar processor with a memory hierarchy, it is important to enhance the data locality of the code to achieve good single node performance. BG/L provides two processor cores in a compute node which are not coherent with each other, and hence, cannot be exploited using conventional multithreading techniques. It supports three different interprocessor networks, a three-dimensional torus, a global combining tree, and a global barrier and interrupt tree network, with implications for algorithm design, given the ability to do certain global communication operations very efficiently. It is important to map the virtual processor topology of the application to the physical 3D torus topology in order to efficiently use the torus links.

Most of the above codes have been successfully scaled to previously unattained levels of eight thousand to sixteen thousand processors. The system design, with an architecture that balances computation and communication capabilities, and the system software design, with a hierarchical organization that enables providing an "interference free" operating system environment on the compute nodes and a highly optimized message passing library clearly played a key role in the rapid scaling of these applications. There are also several innovations at the algorithmic level that led to impressive performance results. The team developed tuned implementations of certain math intrinsic functions, such as exponential and logarithm, to take advantage of the SIMD floating point unit as well as the pipelined processor architecture, often obtaining factors of five to 15 times performance improvement over default implementations of those functions. On some codes, the team had to identify performance bottlenecks using traces of message passing calls, and reorganize communication calls in the code to improve performance. On codes that were not designed with such high levels of scalability in mind, serial components, such as a graph partitioning algorithm, were identified and parallelized.

The Linpack benchmarking effort involved addressing all of the system architectural aspects. A system model was used to highlight unexpected performance characteristics in order to ascertain and pinpoint their source, whether the source was systemic, localized, or incorrect assumptions embodied in the mathematical software. As with applications, the foundation for good Linpack system performance was good single processor performance. To achieve this, we invented new techniques and improved upon known techniques for dealing with core characteristics such as penalties for dealing with unaligned data, the non-least recently used (LRU) nature of the L1 cache, etc., and to take advantage of the available features of the nodes (for example, hardware pre-fetching and the large L3 cache). On the system scale, we used efficient work-management algorithms that efficiently exploit dynamic and predictable work distribution patterns in order move data along the critical path as quickly as possible and have investigated data distribution techniques that enhance load balance in a small-memory system, with affinity for the physical nature (dimensionality) of the underlying machine.

The Blue Gene project started in December 1999. The first chips became available in June 2003. The team demonstrated 1.4 teraflop/s sustained performance on a 512 node system in October 2003, less than four months after the availability of the first set of chips. The team working on the LINPACK code helped place two BG/L systems in the top 10 positions on the June 2004 TOP500 list of supercomputers. They helped capture the number one spot on the November 2004 TOP500 list, displacing the NEC Earth Simulator as the fastest computer in the world. The team working on applications helped showcase the broad applicability of the BG/L system to advancing science by demonstrating the successful scaling of several scientific codes on the BG/L system.

Selected Publications

Related Presentations, Preprints, and Publications


News and Information

Blue Gene in the News

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Rate this article

Innovator's corner  

John GunnelsJohn Gunnels Researcher
What is the most exciting potential future use for the work you're doing?
A lot of my work results in mathematical routines used at the bottom of the pyramid by a number of interesting applications. The
Blue Matter team works on studying reactions that may yield a better understanding of such things as Alzheimer's disease and it would be very rewarding if my routines were to be used in that kind of effort.

What is the most interesting part of your research?
Whenever two disciplines can be used in concert, I find it very interesting. Currently, the work I am doing involves compilers, linear algebra and some very primitive artificial intelligence.


What inspired you to go into this field?
Probably my high school teacher, Mrs. Logan. My friend Joe Young and I spent far too many hours in high school figuring out how computer games were written for the Apple II, and Mrs. Logan gave us only minimal flack about wasting our time while attempting to push us in the right direction.


What is your favorite invention of all time?
Probably the automobile. If one can temporarily suspend concern about pollution, it is one of the few things that I use every day and could not get along without. The harnessing of a controlled explosion, its impact on how society is structured, etc. I wouldn't call it "beautiful" (like the radiometer), but it is, for me, indispensable.

Research team  

Gheorghe Almasi

Gheorghe Almasi

Gyan Bhanot

Gyan Bhanot

Alessandro Curioni

Alessandro Curioni

Robert Germain

Robert Germain

John Gunnels

John Gunnels

Manish Gupta

Manish Gupta

Jim Sexton

Jim Sexton

Bob Walkup

Bob Walkup

Related Research  

Disciplines: Computer Science
Research Areas: Supercomputing
Research Labs: Watson Research Center