When building multicore systems, computer architects must increase performance while exploiting the data-sharing benefits of parallelism. In designing the basic building block for the BlueGene/P supercomputer -- a four-way multicore node -- our architects turned to a filtering technique that recognizes and then eliminates processes that result in unnecessary cost.
As multicore systems scale to increasing core counts, the cost of coherence traffic to communicate between them is growing more severe. In a snooping-based coherence system, each core must inspect the memory traffic of every other core. For BlueGene/P’s write invalidate protocol, for example, snooping affects all write requests.
Hence, as the number of cores increases, the impact of coherence requests waiting to be processed scales up commensurately. Indeed, as each of the n nodes in a multicore system must process all other (n-1) nodes’ memory requests, the number of coherence actions scales with O(n²). The number of coherence actions will affect the overall performance of processors because these coherence requests interfere with a core’s access to its own cache.
One way to reduce the impact of coherence requests is to use dual-port cache directories that let snoop requests and cache accesses proceed in parallel. This solution, however, would come at the cost of a significant power and area increase (compared to a single-ported cache directory). Even so, system performance might be affected adversely as n cores must queue to use a single snoop directory port shared by all of the caches.
To increase performance while reducing power dissipation, the designers of the BlueGene/P multicore compute node turned to filtering out unnecessary coherence requests. In this scenario, each processor is shielded from unnecessary snoop requests by a novel snoop filter unit first introduced in the BlueGene/P system. The new snoop filter unit contains multiple filter engines, each adapted to a particular memory access pattern. Stream registers track a series of accesses by observing a processor’s own memory read requests. Snoop caches remember recently received coherence requests that have already been processed and, therefore, do not need to be repeated. Finally, a software-configurable range filter lets software exclude private data areas from snoop processing.
Where our research is going
Because of global financial and energy concerns, we have a mandate to make computers use less power. We are excited that the system we built became the most power-efficient supercomputer as reflected in its top rank in the “Green500” at its product launch -- an important new metric that will energize the information technology community to optimize energy-efficiency in computers.
Each snoop unit contains one port filter per coherence traffic participant. The port filter consists of multiple snoop filters that work together to maximize snoop efficiency.
Initial performance analysis of the design point was performed using trace-based simulation. As shown in Figure 3, initial analysis demonstrated up to 98 percent efficiency for the popular SPLASH-2 benchmark. The result was a significant improvement in system performance, power, energy and energy-delay characteristics, as demonstrated by application measurements performed using the UMT2K* application on BlueGene/P hardware after system completion (shown in Figure 4).
using trace-based simulation
Figure 4: BlueGene/P hardware measurements: With/without snoop filtering
* UMT2K is a 3D, deterministic, multi-group photon transport program for unstructured meshes..
This article was written by Valentina Salapura on behalf of the BlueGene team.
Valentina Salapura. Scaling Up Next Generation Supercomputers. Keynote at ACM International Conference on Computing Frontiers 2008, Ischia, Italy, May 2008.
Valentina Salapura, Matthias Blumrich and Alan Gara. Design and Implementation of the Blue Gene/P Snoop Filter HPCA'08. The 14th International Symposium on High-Performance Computer Architecture, February 2008.
Matthias Blumrich, Valentina Salapura and Alan Gara. Exploring the Architecture of a Stream Register-Based Snoop Filter, Transactions on HiPEAC, Volume 3, Issue 2, 2008.
Valentina Salapura, Matthias Blumrich and Alan Gara. Improving the Accuracy of Snoop Filtering Using Stream Registers, MEDEA Workshop in conjunction with PACT 2007 Conference, September 2007.
Valentina Salapura. Next generation supercomputers: Exploiting innovative massively parallel system architecture to facilitate breakthrough scientific discoveries. Invited talk at Grace Hopper Celebration of Women in Computing 2007 (GHC 2007), Orlando, Florida, October 2007.
Last updated on December 1, 2008