Computer Architecture: Publications

The BlueGene/L Team, IBM and Lawrence Livermore National Laboratory. An Overview of the BlueGene/L SuperComputer. Proceedings of Supercomputing 2002, November 2002.

Abstract:
This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions, including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004--2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.


Allan M. Hartstein, and Thomas R. Puzak. The Optimum Pipeline Depth for a Microprocessor. Proceedings of 29th Annual International Symposium on Computer Architecture, Anchorage, Alaska.pp. 7-13,June 2002.

Abstract:
The impact of pipeline length on the performance of a microprocessor is explored both theoretically and by simulation. An analytical theory is presented that shows two opposing architectural parameters affect the optimal pipeline length: the degree of instruction level parallelism (superscalar) decreases the optimal pipeline length, while the lack of pipeline stalls increases the optimal pipeline length. This theory is tested by analyzing the optimal pipeline length for 35 applications representing three classes of workloads. Trace tapes are collected from SPEC95 and SPEC2000 applications, traditional (legacy) database and online transaction processing (OLTP) applications, and modern applications primarily written in Java and C++. The results show that there is a clear and significant difference in the optimal pipeline length between the SPEC workloads and both the legacy and modern applications. The SPEC applications, written in C, optimize to a shorter pipeline length than the legacy applications, largely written in assembler language, with relatively little overlap in the two distributions. Additionally, the optimal pipeline length distribution for the C++ and Java workloads overlaps with the legacy applications, suggesting similar workload characteristics. These results are explored across a wide range of superscalar processors, both in-order and out-of-order.


B. Abali, M. Banikazemi, X. W. Shen, H. Franke, D. E. Poff, and T. B. Smith. Hardware Compressed Main Memory: Operating System Support and Performance. IEEE Transactions on Computers, 50, 11, November 2001.

Abstract:
A new memory subsystem, called Memory Xpansion Technology (MXT), has been built for compressing main memory contents. MXT effectively doubles the physically available memory transparently to the CPUs, input/output devices, device drivers, and application software. An average compression ratio of two or greater has been observed for many applications. Since compressibility of memory contents varies dynamically, the size of the memory managed by the operating system is not fixed. In this paper, we describe operating system techniques that can deal with such dynamically changing memory sizes. We also demonstrate the performance impact of memory compression using the SPEC CPU2000 and SPECweb99 benchmarks. Results show that the hardware compression of memory has a negligible performance penalty compared to a standard memory for many applications. For memory starved applications and benchmarks such as SPECweb99, memory compression improves the performance significantly. Results also show that the memory contents of many applications can be compressed, usually by a factor of two to one.


K. Ebcioglu, E.R. Altman, M. Gschwind, S. Sathaye. Dynamic Binary Translation and Optimization. IEEE Transactions on Computers, Volume 50, Issue 6, pp. 529 - 548, June 2001.

Abstract:
We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.


Vijayalakshmi Srinivasan, David M. Brooks, Michael Gschwind, Pradip Bose, and Philip G. Emma. Optimizing Pipelines for Power and Performance. Proceedings of 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 35). IEEE Computer Society. November 2002, p. 333-344.

Abstract:
During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.