By combining application data-level parallelism with data-parallel SIMD execution, IBM Research catapulted the Power Architecture into the premier architecture for graphics and digital media processing, digital content creation, video game consoles and supercomputers.
By the end of the last century, performance improvements to microprocessors were increasingly hard to obtain with traditional techniques such as high frequency and wide issue microprocessing. Design complexity got in the way of assigning more instructions per cycle, while higher clock frequencies resulted in deeper pipelines and made register pressure more problematic.
IBM researchers wanted to make the Power Architecture even more powerful with innovative single instruction multiple data (SIMD) processing. As the name suggests, SIMD processing increases the work a single instruction performs so that it operates on multiple data items simultaneously.
We wanted to extend SIMD's use beyond graphics processing to real-time content creation and high-performance computing applications. Based on these enhancements, Power Architecture with SIMD execution became the processing power-of-choice for Sony, Microsoft and Nintendo when they needed reliable, high-performance microprocessors for their game console. As a result, Power Architecture microprocessors power the next-generation game consoles of all three major game console producers.
The latency proves challenging
Instruction latency – the time it takes to execute a single instruction from start to finish – has increased drastically over several generations. Longer latencies mean more data must be held in registers to keep execution from stalling, and thereby creating greater demand for more registers. The number of registers increased over time from eight or 16 during the CISC era to 32 registers for RISC* architectures.
A simple test program (Figure 1) reveals that the ratio between “register pressure” and number of registers has gotten worse over several generations. The chart shows register requirements needed to keep the fused multiply-add (FMA) operation from stalling because of register starvation, which occurs as a result of compute latency. Simple programs began to experience these limitations, and real programs slowed down significantly.
To solve this problem, the traditional solution was to use rename hardware -- a complex, expensive and power-hungry technology. We opted instead to increase the number of architected registers significantly to 128.
Source: Gschwind et al., Hot Chips 17, 2005
Synergistic processing in Cell's multicore architecture
When we began developing the most power- and area-efficient core in the history of microprocessors for the Cell BE –- so that we could build a supercomputer on a chip that would offer a tenfold increase of performance over any previous design –naturally we decided to exploit SIMD processing in the Cell Synergistic Processing Unit (SPE).
Defining a new instruction set from the ground up gave us an opportunity to accomplish many goals beyond what we could achieve merely by extending an existing architecture. We used the instruction set to create a simpler design with SIMD RISC, and to integrate scalar and SIMD execution with pervasively data-parallel computing (PDPC) that uses wide 16B data everywhere to increase processing efficiency.
Figure 2 shows four vector execution units that operate on 16B SIMD data for fixed point, floating point, data formatting and load/store instructions. Following the PDPC concept, no separate execution units are provided for scalar instructions. Instead, the compiler must exploit the parallel hardware and to direct programs to use SIMD hardware even for scalar execution ("scalar layering").
Source: Gschwind et al., Hot Chips 17, 2005
Power Architecture acceleration with next-generation SIMD
While supporting the increased number of registers was easy to accommodate in a new instruction set such as the Cell SPE, it was more challenging to add more registers and new functions into the pre-existing Power Architecture set while complying with the Power Architecture instruction formats of the established VMX multimedia instructions. At the same time, specifying an efficient compilation target excluded the use of special formats, such as paired registers, register subsetting or register windowing.
As we looked into next-generation SIMD processing, we found a new encoding that facilitated efficient implementation using non-contiguous register specifiers (Figure 3). To address 128 registers, the number of bits for operands located in the registers (VT, VA and VB) had to increase from five to seven. We did not move the original five bits. The additional two bits to address registers exploited unused bits in the instruction (labeled VTX, AX and VBX).
Source: Gschwind et al., US2005000707573P, 2005
Microsoft was the first proving ground for the VMX2 concept. Microsoft needed high-performance chip multi-processors for the Xbox game console. Ultimately, the client found that its tremendous processing needs could be met only with a Power Architecture-based chip multi-processor built on the innovative VMX2 project.
Using results from our VMX2 research, we put together a proposal called “VMX128” in which we recommended pushing an existing RISC architecture to an incredible 128 registers. Our proposal helped IBM win the bid for the design of the Microsoft Xbox 360.
Data-parallel SIMD vector processing continues to be an important research area. Future technologies will benefit from our research efforts to increase microprocessor performance for future IBM server and BlueGene systems, as well as for consumer electronics applications.
Acknowledgements
In addition to our customers and Power.org partners, the IBM Research data parallel SIMD processing effort has benefited from the contributions of the IBM PowerPC Architecture Control Board; IBM design groups in Austin, Raleigh and Rochester; the IBM compiler groups at the Thomas J. Watson Research Center, Toronto and Haifa Research Labs; and application groups in the Watson Research Center, Haifa Research Labs and the S/T/I design center that developed the Cell Broadband Engine in Austin.
* The RISC concept was pioneered in the course of the 801 project at IBM T.J. Watson Research Center, and first commercialized with the RISC System/6000® family based on the new IBM Power (Performance Optimization With Enhanced RISC) Architecture. The hardware implementation takes advantage of this powerful "reduced instruction set computer" architecture and employs sophisticated design techniques to achieve a short cycle time and a low cycles-per-instruction (CPI) ratio.
Related Publications
Michael Gschwind. The Cell Broadband Engine: Exploiting multiple levels of parallelism in a chip multiprocessor. International Journal of Parallel Programming 35(3):233-262, June 2007.
Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe and Takeshi Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro 26(2):10-24, March 2006.
Michael Gschwind, David Erb, Sid Manning and Mark Nutter. An Open Source Environment for Cell Broadband Engine System Software. IEEE Computer 40(6):37-47, June 2007.
Michael Gschwind, H Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe and Takeshi Yamazaki. Synergistic Processing in Cell's Multicore Architecture. IEEE Micro 2006 26(2):10-24, March 2006.
Jeff Andrews and Nick Baker. Xbox 360 System Architecture. IEEE Micro, March 2006.
Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung and Hunter Scales. AltiVec Extension to PowerPC Accelerates Media Processing (abstract). IEEE Micro 20(2):85-95, March 2000.
Presentations
Cell Broadband Engine: Enabling density computing for data-rich environments, International Symposium on Computer Architecture tutorial (pdf), 2006.
Chip multiprocessing and the Cell Broadband Engine, ACM Computing Frontiers, keynote speech, 2006.
A novel SIMD architecture for the Cell heterogeneous chip multiprocessor (pdf), Hot Chips 17, 2005.
Altivec: A Second Generation SIMD Microprocessor Architecture (pdf), Hot Chips 10, 1998.
Patents
SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode.
M. Gschwind, P. Hofstee, M. Hopkins.
2005/01/04. Issued as US patent 6839828.
Additional patents pending.
Awards
EE Times ACE Design Team of the Year 2006 (Xbox360 Team).
Technology of the Year (Cell), IEEE Spectrum 2006.
Microprocessor Report Analysts' Choice Award 2005 for Best High-Performance Embedded Processor: Cell Broadband Engine.
Microprocessor Report Best Technology Award 2004 for "STI Design Center's CELL Processor."
EE Times "Best In Class" Design Team for STI Cell development team 2004.
Microprocessor Report Analysts’ Choice Award for Best Desktop Processor of 2003 for IBM’s PowerPC 970FX.
Rate this article




