IBM Research welcomes members of the research community to our seminars. To ensure compliance with IBM security guidelines, we request you to contact the seminar host in advance. When you arrive at the Research lab, please provide the host's name to the receptionist.
| Fast, Accurate Simulators and Auto-Generated Implementations | ||||
| Derek Chiou | On: | 20-Nov-2009 10:00 AM - 11:30 AM | ||
| Assistant professor | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| University of Texas at Austin | ||||
Abstract: In this talk, I will describe (i) the FPGA-Accelerated Simulation Technologies (FAST) methodology for building fast, scalable, full-system, cycle-accurate simulators of multicore target systems and (ii) how a description of such simulators can be automatically transformed into a synthesizable RTL implementation that contains the target micro-architecture. We are working on a version of the FAST simulator that runs on a multicore+FPGA host platform and simulates a multicore x86 target running unmodified Linux and applications while predicting performance at a cycle-accurate level. Simulation speeds are expected to be an aggregate 5MIPS-10MIPS per host core in cycle-accurate mode and significantly faster at lower target accuracies. The simulator is currently being augmented with power estimation and reliability modeling capabilities at the same simulation speeds. Speaker biography: Derek Chiou is an assistant professor at the University of Texas at Austin. Before UT, Dr. Chiou was a system architect for five years at Avici Systems, a manufacturer of terabit core routers. Dr. Chiou received his Ph.D., S.M. and S.B. degrees in Electrical Engineering and Computer Science from MIT. | ||||
| Intel Core i7 and Intel Xeon 5500 Microarchitecture, Optimization and Performance Analysis | ||||
| David Levinthal | On: | 12-Nov-2009 09:00 AM - 05:45 PM | ||
| Intel | At: | Watson Research Center (Yorktown), Room 20-043 | ||
Abstract: This tutorial covers architectural design, performance analysis and software optimization of Core i7 and Xeon 5500 microarchitectures, Intel's new microprocessor family. After presenting architectural details of the new microprocessor family, the tutorial presents a number of optimizations, developed by the compiler team, that exploit the new microarchitectural features. The tutorial discusses how a performance analyst can use the PMU (performance monitoring unit) events to understand application performance. The tutorial also discusses NUMA optimization issues and how to identify optimization opportunities with the PMU events. The tutorial concludes with a demonstration of the PTU tool on pre-collected Intel Xeon 5500 performance data to help tie all of the material presented in the tutorial together. Speaker biography: Dr Levinthal's principal responsibilities are performance analysis and software optimization. These responsibilities include the specification and validation of PMU events, the specification and validation of performance tools, and last level processor enabling support. Prior to joining Intel Dr. Levinthal was a tenured Professor at Florida State University for ten years, specializing in experimental particle physics and working at Fermi National Laboratory and CERN. He has a Ph.D. and two masters degrees from Columbia University and a B.A. from University of California at Berkeley. His academic honors include a Sloan Fellowship, an NSF PYI award and the DOE OJI award. | ||||
| The GPU is dead, long live the GPU | ||||
| Karu Sankaralingam | On: | 29-Oct-2009 10:00 AM - 11:30 AM | ||
| Assistant professor | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| The University of Wisconsin-Madison | ||||
Abstract: There is tremendous interest in both academia and industry on GPUs and how they can evolve into high throughput engines. A plethora of tools, libraries, and frameworks have been proposed for programming this rigid and restricted GPU architecture. In this talk, I will first show why this approach is flawed. I will show that conventional GPUs are becoming unsuitable for future graphics tasks in the first place. I characterize modern GPUs and show that this system revolves around the Z-buffer algorithm and how this is posing problems in providing significant visual quality improvements. I then show that an approach centered around applications and sophisticated visibility algorithms with ray-tracing is possible. The architecture and application challenges can be addressed through a full system co-design approach. I present an entire graphics system called Copernicus that demonstrates algorithms, a software architecture, and a hardware architecture for real-time rendering using ray-tracing. In this talk, I describe different parts of this ongoing project focusing on the architecture and implications for GPUs and future CPUs. I will conclude with some other architecture directions my group is pursuing driven by this work including our work on hardware reliability and energy efficiency. Speaker biography: Karu Sankaralingam (http://www.cs.wisc.edu/~karu) is an assistant professor in the computer sciences department at the University of Wisconsin-Madison, where he also leads the Vertical Research Group (http://www.cs.wisc.edu/vertical/). His research interests include microarchitecture, architecture, and software issues for massively parallel computation systems. He is a recipient of the NSF CAREER award. He earned a PhD from The University of Texas at Austin in December 2006, and was the lead student architect of the TRIPS chip, a 170 million transistor chip. | ||||
| Amdahl's Law in the Multicore Era | ||||
| Mark D. Hill | On: | 22-Jul-2009 10:00 AM - 11:30 AM | ||
| University of Wisconsin-Madison | At: | Watson Research Center (Yorktown), Room 20-001 | ||
Abstract: Over the last several decades computer architects have been phenomenally successful turning the transistor bounty provided by Moore's Law into chips with ever increasing single-threaded performance. During many of these successful years, however, many researchers paid scant attention to multiprocessor work. Now as vendors turn to multicore chips, researchers are reacting with more papers on multi-threaded systems. While this is good, we are concerned that further work on single-thread performance will be squashed. To help understand future high-level trade-offs, we develop a corollary to Amdahl's Law for multicore chips [Hill & Marty, IEEE Computer 2008]. It models fixed chip resources for alternative designs that use symmetric cores, asymmetric cores, or dynamic techniques that allow cores to work together on sequential execution. Our results encourage multicore designers to view performance of the entire chip rather than focus on core efficiencies. Moreover, we observe that obtaining optimal multicore performance requires further research BOTH in extracting more parallelism and making sequential cores faster. Speaker biography: Mark D. Hill (http://www.cs.wisc.edu/~markhill) is professor in both the Computer Sciences Department and the Electrical and Computer Engineering Department at the University of Wisconsin-Madison, where he also co-leads the Wisconsin Multifacet project with David Wood. He earned a Ph.D. from the University of California, Berkeley. He is an ACM Fellow and a Fellow of the IEEE. His past work ranges from refining multiprocessor memory consistency models to developing the 3C model of cache behavior (compulsory, capacity, and conflict misses). | ||||
| Coordinated Power and Performance Control in Virtualized Data Centers | ||||
| Xiaorui Wang | On: | 24-Jun-2009 03:00 PM - 04:00 PM | ||
| University Of Tennessee | At: | Austin Research Lab, Room YKT 20-001 (Seminar simulcast from Austin) | ||
Abstract: In recent years, power control has become one of the most serious concerns for large-scale data centers that are rapidly expanding the number of hosted servers. In addition to reducing operating costs, precisely controlling power consumption and heat dissipation is an essential way to avoid system failures caused by power capacity overload or overheating due to increasingly high server density (e.g., blade servers). Power control becomes even more challenging as many data centers start to adopt virtualization technology for resource sharing, leading to increased utilization and power consumption for each server. In this talk, we will present a coordinated power and performance control framework designed for today’s virtualized data centers. Our framework first provides highly scalable power control solutions in a hierarchical way at three different levels: single servers, server racks, and entire data centers, as there exist physical and contractual power limits at each level. Our framework also includes novel performance control algorithms that can provide power-efficient application-level performance guarantees for multiple virtual machines running on the same physical server, despite that their performance is correlated and can be impacted differently by the power state transitions of the shared hardware resources. Furthermore, our control framework coordinates various power and performance control schemes at different system layers to achieve simultaneous guarantees on both power and performance for virtualized data centers. Different from related work, our control solutions feature a rigorous system design methodology based on recent advances in feedback control theory for analytical assurance of control accuracy and system stability. We will also introduce our work on power management for chip multiprocessors and on-chip L2 caches. Speaker biography: Xiaorui Wang is an Assistant Professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. He received his Ph.D. in Computer Science from Washington University in St. Louis in 2006. | ||||
| Enabling Fair, High-Performance, and Scalable Memory Systems | ||||
| Onur Mutlu | On: | 19-Jun-2009 10:30 AM - 12:00 PM | ||
| Assistant Professor of ECE/CS | At: | Austin Research Lab, Room Yorktown 20-001 (Seminar simulcast from Austin) | ||
| Carnegie Mellon University | ||||
Abstract: In a chip-multiprocessor system, the DRAM system is shared among cores. In a shared DRAM system, requests from a thread can not only delay requests from other threads by causing bank/bus/row-buffer conflicts but they can also destroy other threads’ DRAM-bank-level parallelism. Requests whose latencies would otherwise have been overlapped could effectively become serialized. As a result both fairness and system throughput degrade, and some threads can starve for long time periods. I will describe a fundamentally new approach to designing a shared DRAM controller that provides quality of service to threads, while also improving system throughput. Our parallelism-aware batch scheduler (PAR-BS) design is based on two key ideas. First, PAR-BS processes DRAM requests in batches to provide fairness and to avoid starvation of requests. Second, to optimize system throughput, PAR-BS employs a parallelism-aware DRAM scheduling policy that aims to process requests from a thread in parallel in the DRAM banks, thereby reducing the memory-related stall-time experienced by the thread. PAR-BS seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities. Our evaluations show that PAR-BS provides better system performance and fairness compared to four previously proposed memory scheduling techniques, including Stall-Time Fair Memory scheduling. In the last portion of the talk, I will briefly describe some of the future directions in my group related to designing scalable and software-controllable many-core systems. Some of these projects include Bufferless On-Chip Networks, QoS- and Application-Aware On-Chip Networks, and Hardware/Software Cooperative QoS in Memory Systems and On-Chip Networks. Speaker biography: Onur Mutlu is an Assistant Professor of ECE/CS at Carnegie Mellon University. He has a PhD and an MS in ECE from the University of Texas at Austin and BS degrees in Computer Engineering and Psychology from the University of Michigan, Ann Arbor. | ||||
| Low-Power, High-Performance Analog Neural Branch Prediction | ||||
| Daniel Jiménez | On: | 28-May-2009 11:00 AM - 12:30 PM | ||
| Associate professor | At: | Watson Research Center (Yorktown), Room YKT 32-036 (Seminar simulcast from Austin) | ||
| University Of Texas (San Antonio) | ||||
Abstract: Shrinking transistor sizes and a trend toward low-power processors have caused increased leakage, high per-device variation and a larger number of hard and soft errors. Maintaining precise digital behavior on these devices grows more expensive with each technology generation. In some cases, replacing digital units with analog equivalents allows similar computation to be performed at higher speed and lower power. The units that can most easily benefit from this approach are those whose results do not have to be precise, such as various types of predictors. We introduce the Scaled Neural Predictor, a highly accurate prediction algorithm that is infeasible in a purely digital implementation but can be implemented using analog circuitry. Our predictor uses current summation to replace the expensive digital dot-product computation required in perceptron predictors. We show that the analog predictor can outperform digital neural predictors because of the reduced cost, in power and latency, of the key computations. The analog neural predictor circuit is able to produce an accuracy equivalent to an infeasible digital neural predictor that requires 128 additions per prediction. The analog version, however, can run in 200 picoseconds, with the analog portion of the prediction computation requiring less than 0.4 milliwatts at a 45 nm technology, which is negligible compared to the power required for the table lookups in this and conventional predictors. Speaker biography: Daniel Jiménez is an associate professor with tenure in the Department of Computer Science at the University of Texas at San Antonio. Daniel's current research interests include microarchitecture and low-level compiler optimizations. | ||||
| The End of Denial Architecture and the Rise of Throughput Computing | ||||
| Bill Dally | On: | 20-May-2009 10:00 PM | ||
| CTO and VP of Research | At: | Watson Research Center (Yorktown), Room 26-004 | ||
| NVidia Corporation | ||||
Abstract: Most modern processors are in denial about two critical aspects of machine organization. They hide from the programmer and compiler the underlying parallel execution and hierarchical memory organization presenting an illusion of sequential execution and uniform, flat memory. The evolution of these sequential, latency-optimized processors is at an end, and their performance is increasing only slowly over time. In contrast, the performance of throughput-optimized processors, like GPUs, continue to scale at historical rates. Throughput processors have hundreds of cores today and will have thousands of cores by 2015. They will deliver most of the performance, and most of the user value, in future computer systems. Throughput processors embrace, rather than deny, parallelism and memory hierarchy to realize large performance and efficiency advantages. Parallelism can take advantage of the plentiful and inexpensive arithmetic units in a throughput processor. Without locality, however, bandwidth quickly becomes a bottleneck. Communication bandwidth, not arithmetic is the critical resource in a modern computing system that dominates cost, performance, and power. Stream programming simplifies the exploitation of both parallelism and locality. A stream program naturally exposes parallelism across stream elements and kernels. Locality is also exposed - both within and between kernels. This talk will discuss exploitation of parallelism and locality using stream processing with examples drawn from the Imagine and Merrimac projects, from NVIDIA GPUs, and from three generations of stream programming systems. | ||||
| Exploring Strategies to Mitigate the Off-Chip Bandwidth Scarcity on Multicore Architectures | ||||
| Yan Solihin | On: | 8-May-2009 10:00 AM - 11:30 AM | ||
| Associate professor, Department of Electrical and Computer Engineering | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| North Carolina State University | ||||
Abstract: In this talk, I will briefly show the looming problem of the bandwidth wall, a situation that arises when the growth of number of cores greatly outpaces the growth of off-chip pin bandwidth. This introduces the context of the two ongoing projects that I will discuss. The first project deals with how off-chip bandwidth partitioning is related to last level cache partitioning. We derive a simple analytical model which tries to answer the following questions: (1) Does bandwidth partitioning only have a secondary impact on performance compared to cache partitioning? (2) If we have an ideal cache partitioning scheme, do we still need memory bandwidth partitioning? (3) What is the optimal memory bandwidth partitioning decision for the best system performance? The model gives us an ability to answer the questions, and I will present interesting insights that we obtain from the model. The second project deals with the pipeline and cache inefficiencies of current bulk memory operations. A bulk (large-region) memory copying and initialization is one of the most ubiquitous operations performed in current computer systems. While many current systems rely on a loop of loads and stores, there are proposals to introduce a single instruction to perform large-region memory copying. While such an instruction can improve performance due to generating fewer TLB and cache accesses, and requiring fewer pipeline resources, we will show that the key to significantly improving the performance of such instructions is removing pipeline and cache bottlenecks of the code that follows the instructions. We show how we can address such inefficiencies with a special engine to perform bulk memory copying. When applied to OS kernel buffer management, we show that on average the proposed architecture support achieves anywhere between 23% to 32% speedup ratios, which is roughly 3-4X higher than an alternative scheme, and 1.5-2X better than a highly optimistic DMA engine with zero setup and interrupt overheads. Speaker biography: Yan Solihin currently serves as an associate professor at the Department of Electrical and Computer Engineering at North Carolina State University. He obtained his PhD degree from the University of Illinois at Urbana-Champaign in 2002. | ||||
| The Haves and Have-nots in Many-Core Parallel Computing | ||||
| Wen-mei Hwu | On: | 30-Apr-2009 10:00 AM - 11:30 AM | ||
| Professor | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| University of Illinois at Urbana-Champaign | ||||
Abstract: Modern GPUs such as the NVIDIA GeForce GTX280 and AMD/ATI Radeon 4860 are massively parallel, many-core processors. Today, developers are reporting orders of magnitude variation in applications performance on these processors. It is of no surprise that the single most important factor of the achievable speedup for an application is its memory bandwidth requirement, which can be roughly defined as the number of arithmetic operations performed for each memory byte accessed from off-chip memory. In this talk, I will discuss the reasons why the memory bandwidth requirement will likely further enlarge the performance gap between applications. I will then present our experience in tackling the memory bandwidth requirement in several classes of applications as well as our current work on problem solving strategies and programming tools that will likely help application developers to drastically reduce the memory bandwidth requirement of their applications. Speaker biography: Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, and software for high performance computer systems. He is the director of the IMPACT research group (www.crhc.uiuc.edu/Impact). | ||||
| Power to the People (PTP): Learning and Leveraging the Relationship between Architectural Properties and User Satisfaction | ||||
| Gokhan Memik | On: | 17-Apr-2009 10:00 AM - 11:30 AM | ||
| Asst Professor | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| Northwestern University | ||||
Abstract: The ultimate goal of any product is to satisfy its users. In this talk, I will describe our work in trying to understand the relation between computer hardware performance and user satisfaction. Specifically, we analyze the relationship between hardware performance counter (HPC) readings and individual satisfaction levels reported by users for several interactive applications. Our results show that the satisfaction of the user is strongly correlated to the performance of the underlying hardware and more importantly, that user satisfaction is highly user-dependent. Our first set of experiments show that there is large variation in expected performance among individual users. This variation can be exploited by clients/servers to optimize their services according to the needs of the individual users. To achieve this, computers need to collect information about user satisfaction. The second part of my talk focuses on our efforts to achieve this goal. Specifically, I will describe our work on the development of new biometric input devices for providing the computer information about the user’s physiological traits. The goal in this work is to understand the users’ involvement/satisfaction by monitoring their physiological traits and making architectural decisions accordingly. We explore three biometric devices as potential sensors: an eye tracker, a galvanic skin response (GSR) sensor, and force sensors. Our initial experiments show that there are significant changes in human physiological traits as processor performance decreases. We also observe that the physiological changes correlate strongly to the satisfaction levels reported by the users. The implications of these results are numerous. First, these results illustrate that the physiological traits can be effectively used to control the processor: for example, we show that the total system power consumption of a laptop can be reduced by up to 18.4% on average using a physiological-traits based power management scheme. More importantly, they indicate that the user satisfaction can be extracted without causing any dissatisfaction to the user. As a result, many architectural decisions (e.g., parallelism of an application, reliability and power related performance fluctuations, etc.) can be made while considering the individual user satisfaction. I will conclude the talk by highlighting other ongoing/recent projects. We have developed variable access latency caches that can be used to increase the performance under large variations in circuit properties (i.e., process and thermal variations). I will then describe how architectures can be designed to minimize the yield losses and highlight possible effects of such architectural schemes on the profits obtained from a set of chips. Then, I will present the clumsy processors and discuss our work in thermal-aware architectures. Speaker biography: Prof. Gokhan Memik is the Wissner-Slivka Junior Chair Assistant Professor at the Electrical Engineering and Computer Science Department of Northwestern University. He received B.S. in Computer Engineering in 1998 from Bogazici University (Istanbul, Turkey) and PhD in Electrical Engineering from University of California at Los Angeles (UCLA) in 2003 under the supervision of William H. Mangione-Smith. | ||||
| On Building and Securing Processors | ||||
| Simha Sethumadhavan | On: | 6-Mar-2009 07:07 PM | ||
| Assistant Professor | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| Columbia University | ||||
Abstract: This talk has two themes. One is on techniques for building highly concurrent processors. The other is on mechanisms for improving security in concurrent processors. The primary memory system of a processor poses a problem for a tile- based design methodology because some of the functionality provided by the memory system such as maintaining sequential memory semantics is not easily partitioned. I will describe a tiled primary memory system design in which none of the functions are centralized and thus can be easily apportioned into multiple tiles. In particular, I will focus on the design of load/store queues, dependence predictors, and memory consistency mechanisms for a tiled microarchitecture; I will also discuss how some of the aspects of the on-chip interconnection network, an emerging feature, can be used to enhance the memory system. I will show how the complete distribution of memory system permits the design of a scalable primary memory system and enables microarchitectures with new capabilities. While performance, power and area have largely been the driving force behind architectures, security is also becoming increasingly important. The ongoing architectural rethinking in the context of Chip Multiprocessors provides a unique opportunity to integrate the support required to build trusted systems into hardware. Recently several full system efforts have been proposed to provide policy enforcement and protect critical secrets assuming a trusted hardware environment and a trusted software module. However, to extend the security envelope even further it is desirable to protect the trusted hardware and software components. The problem domain represents both challenges (e.g., insider attacks, the cost of implementing the security mechanisms) and opportunities (e.g., new security features that can also be used to improve concurrency). In this talk, I will describe some examples of insider attacks on the trusted components and present ongoing research aimed at detecting attacks on the trusted components. Time permitting, I will describe ongoing research on a collaborative method for parallelizing and optimizing legacy applications. Speaker biography: Simha Sethumadhavan is an Assistant Professor of Computer Sciences at Columbia University. He received his PhD from UT-Austin in 2008. Simha’s current research focus in on mitigating two of the big threats to continued computer improvements: security and concurrency. | ||||
| Maintaining Performance Scalability in the Petaflop Era and Beyond with Hybrid Bulk-Streaming/Threading Architectures and Programs | ||||
| Mattan Erez | On: | 13-Feb-2009 11:00 AM - 12:30 PM | ||
| University Of Texas (Austin) | At: | Austin Research Lab, Room | ||
Abstract: While Moore's Law is still going strong, device and circuit designers cannot maintain traditional CMOS VLSI scaling trends. As a result, architects, programmers, and algorithm developers must work together to increase efficiency and continue to deliver better and faster solutions. This talk will focus on the architecture perspective and implications on programming and algorithms with emphasis on future designs and potential long-term performance portability. I will describe the characteristics of modern VLSI processes and future projections and introduce architectural techniques, based on fundamental principles of locality, parallelism, and hierarchical control (LPH) to improve performance and efficiency. I will also discuss current trends in current processor architecture and argue why I believe we are converging to two extremes in execution models: threading and bulk-streaming. Based on this observation I will advocate a hybrid bulk/threading architecture model, and explain how it can lead to scalable processors, systems, and programming models while simplifying writing, compiling, and maintaining performance-portable codes. I will present initial results on mechanisms to bridge the gap between bulk-streaming and threading with respect to dynamic and irregular applications. I will also briefly describe other ongoing research in my group aimed at maintaining performance scalability, including research on hybrid on-chip interconnects and techniques to reduce the cost and performance impact of strong error-correction for future fault-tolerant memory systems. | ||||
| Hardbound: Hardware for Making C as Secure and Spatially Safe as Java | ||||
| Milo Martin | On: | 5-Dec-2008 10:00 AM - 11:30 AM | ||
| Assistant Professor | At: | Watson Research Center (Yorktown), Room 20-001 | ||
| University of Pennsylvania | ||||
Abstract: The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C's unchecked pointer arithmetic and array indexing allow simple programming mistakes to lead to erroneous executions, silent data corruption, and security vulnerabilities. Many prior proposals have tackled enforcing spatial safety in C programs by checking pointer and array accesses. However, existing software-only proposals have significant drawbacks that may prevent wide adoption, including: unacceptably high runtime overheads, lack of completeness, incompatible pointer representations, or need for non-trivial changes to existing C source code and compiler infrastructure. Inspired by the promise of these software-only approaches, this work proposes a hardware bounded pointer architectural primitive that supports cooperative hardware/software enforcement of spatial memory safety for C programs. This bounded pointer is a new hardware primitive datatype for pointers that leaves the standard C pointer representation intact, but augments it with bounds information maintained separately and invisibly by the hardware. The bounds are initialized by the software, and they are then propagated and enforced transparently by the hardware, which automatically checks a pointer's bounds before it is dereferenced. One mode of use requires instrumenting only malloc(), which enables enforcement of per-allocation spatial safety for heap-allocated objects for existing binaries. When combined with simple intra-procedural compiler instrumentation, hardware bounded pointers enable a low-overhead approach for enforcing complete spatial memory safety in unmodified C programs. Speaker biography: Milo Martin is an Assistant Professor in the Computer and Information Science Department at the University of Pennsylvania. His research focuses on making computers easier to design, verify, and program. Specific projects include adaptive cache coherence protocols, multiprocessor design verification, transactional memory to assist in creating multithreaded programs, hardware-aware verification of concurrent software, and hardware to support efficient implementations of C that are as safe and secure as memory-safe languages such as Java. Dr. Martin is a recipient of the NSF CAREER award and received a PhD from the University of Wisconsin-Madison in 2003. | ||||
