This project looks at a broad set of high performance computing applications, including life sciences, financial modeling, hydrodynamics, quantum chemistry, molecular dynamics, astronomy, space research and climate modeling. For example, the Car-Parrinello Molecular Dynamics (CPMD) application on BG/L will increase by an order of magnitude the time window that we can explore using accurate ab-initio MD - this is especially important for bio-related applications (e.g., for a timely and accurate study of complex enzymatic reactions or molecule/protein interactions). It will also increase by an order of magnitude the size (number of atoms) of the system we can simulate using ab-initio methods - this is especially important in material science (e.g., in the simulation of complex interfaces such as the silicon/high-k oxide interface).
The Blue Mattercode will allow researchers to study protein folding, protein dynamics and the interactions of protein complexes and aggregates. Proteins are essential to life, and their malfunction is responsible for many diseases, including cystic fibrosis, mad cow and Alzheimer's. Scaling the high-order multiscale modeling environment (HOMME) code from National Center for Atmospheric Research (NCAR) would allow direct numerical simulation of cloud processes on a global scale, improving the accuracy of climate models. The quantum chromodynamics (QCD) computation being tuned on BG/L is of basic importance to cosmology and early universe physics. The locally self-consistent multiple scattering (LSMS) code allows the study of electronic structures in nanoparticles, with applications to the design of hard disk drives, for example. The FLASH code is a multi-physics simulation code designed to solve nuclear astrophysical problems related to exploding stars. The simulation performed on BG/L is the deflagration phase of a Type-1a supernova beginning with the deflagration initiated near the center of the white dwarf star.
IBM's Blue Gene/L supercomputer
To fully realize the potential of BG/L technology, these codes need to be scaled to unprecedented levels of parallelism, to over a hundred thousand processors for the large BG/L system being delivered to Lawrence Livermore National Laboratory.To begin with, researchers need to make sure they get good single node performance because that provides the baseline for improvements through exploitation of multi-node parallelism. Successful scaling requires a careful attention to stamping out serial components from the code so that Amdahl's Law does not kill the project. They need creative approaches to reduce the overhead of passing messages between nodes to ensure efficient parallel execution.
BG/L offers a rich platform with challenges on each of the above goals. It provides a two-way Single Instruction, Multiple Data (SIMD) floating point unit on each central processing unit (CPU), where successful exploitation requires exposing fine grain parallelism and also dealing with data alignment issues. As with any scalar processor with a memory hierarchy, it is important to enhance the data locality of the code to achieve good single node performance. BG/L provides two processor cores in a compute node which are not coherent with each other, and hence, cannot be exploited using conventional multithreading techniques. It supports three different interprocessor networks, a three-dimensional torus, a global combining tree, and a global barrier and interrupt tree network, with implications for algorithm design, given the ability to do certain global communication operations very efficiently. It is important to map the virtual processor topology of the application to the physical 3D torus topology in order to efficiently use the torus links.
Most of the above codes have been successfully scaled to previously unattained levels of eight thousand to sixteen thousand processors. The system design, with an architecture that balances computation and communication capabilities, and the system software design, with a hierarchical organization that enables providing an "interference free" operating system environment on the compute nodes and a highly optimized message passing library clearly played a key role in the rapid scaling of these applications. There are also several innovations at the algorithmic level that led to impressive performance results. The team developed tuned implementations of certain math intrinsic functions, such as exponential and logarithm, to take advantage of the SIMD floating point unit as well as the pipelined processor architecture, often obtaining factors of five to 15 times performance improvement over default implementations of those functions. On some codes, the team had to identify performance bottlenecks using traces of message passing calls, and reorganize communication calls in the code to improve performance. On codes that were not designed with such high levels of scalability in mind, serial components, such as a graph partitioning algorithm, were identified and parallelized.
The Linpack benchmarking effort involved addressing all of the system architectural aspects. A system model was used to highlight unexpected performance characteristics in order to ascertain and pinpoint their source, whether the source was systemic, localized, or incorrect assumptions embodied in the mathematical software. As with applications, the foundation for good Linpack system performance was good single processor performance. To achieve this, we invented new techniques and improved upon known techniques for dealing with core characteristics such as penalties for dealing with unaligned data, the non-least recently used (LRU) nature of the L1 cache, etc., and to take advantage of the available features of the nodes (for example, hardware pre-fetching and the large L3 cache). On the system scale, we used efficient work-management algorithms that efficiently exploit dynamic and predictable work distribution patterns in order move data along the critical path as quickly as possible and have investigated data distribution techniques that enhance load balance in a small-memory system, with affinity for the physical nature (dimensionality) of the underlying machine.
The Blue Gene project started in December 1999. The first chips became available in June 2003. The team demonstrated 1.4 teraflop/s sustained performance on a 512 node system in October 2003, less than four months after the availability of the first set of chips. The team working on the LINPACK code helped place two BG/L systems in the top 10 positions on the June 2004 TOP500 list of supercomputers. They helped capture the number one spot on the November 2004 TOP500 list, displacing the NEC Earth Simulator as the fastest computer in the world. The team working on applications helped showcase the broad applicability of the BG/L system to advancing science by demonstrating the successful scaling of several scientific codes on the BG/L system.
Selected Publications
Related Presentations, Preprints, and Publications
News and Information
Blue Gene in the News
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Rate this article

John Gunnels Researcher 






