Distributed and Fault Tolerant Computing


Distributed and Fault-Tolerant Computing (DFTC) addresses many of the core issues in computer science and is a critical area of research for IBM. Our researchers work at a host of research locations across the world and are interested in a wide range of topics. Our current focus is on addressing key challenges in Grid and Autonomic computing, and flexible integration of disparate IT and business processes for enabling on-demand services – all of which have a significant impact in transforming today’s businesses:

Integration -- creating business flexibility by integration of disparate, unconnected business and IT process.
Automation -- reducing costs and increasing business responsiveness through IT and business linkage.
Virtualization -- improving working capital and asset utilization.

Links to a few representative projects are listed below along with a sampling of different aspects of our activities in the larger research community.

WSLA: Web Service Level Agreement for On Demand Services
WSLA is a novel framework for specifying and monitoring Service Level Agreements (SLAs) for Web Services. The framework components include tooling for online creation of SLAs, dynamic deployment of SLAs and measurement and monitoring of QoS parameters for checking the agreed-upon service levels, and reporting violations to the authorized parties involved in the SLA management process.


WSMM - Web Services Management Middleware for On Demand Services
WSMM provides fundamental support for differentiated services based on Service Level Agreements (SLAs), for web services. This enables service providers to offer the same web service, on-demand, at different performance levels (e.g., response time thresholds and throughput limits). WSMM dynamically allocates resources to web service requests, with the goal of optimizing the system utility. WSMM and WSLA technologies are available for download as part of the
Emerging Technologies Toolkit.


Web Services On Demand

Rainforest
The goal of Rainforest (
follow on to Océano project) is to enable policy directed dynamic componentized autonomic provisioning and management. The primary objectives are to provide a framework for heterogeneity of resources and solutions; to autonomically create complex provisioning tasks and policies from basic elements; and to utilize existing provisioning products and management systems.

ReGS- Reporting Grid Service
ReGS is an OGSA-based Meta-OS Grid Service for logging, tracing and monitoring applications in a distributed, heterogeneous computer environment, with extensible filtering capabilities. It exploits OGSI interfaces and provides standard logging interfaces for use by other Grid Services and Applications by virtualizing existing logging systems, e.g., zOS logging, NT events, and Unix syslogs.


Optimal Grid
The Optimal Grid middleware provides a grid-enabled collaboration framework, sophisticated management infrastructure, and problem-solving environment for grid computing. It is designed to help users harness the power of future grid utilities by hiding the complexities of partitioning, distributing, load balancing, and adapting a grid application to dynamic changes in available compute resources.


Gryphon
The Gryphon project focuses on advancing messaging middleware. It extends the scalable publish/subscribe framework to support efficient content-based routing in wide-area networks and stronger guarantees for message delivery by developing protocols tolerant to failures in the overlay network and in the clients.


Policy-based Computing Systems
In policy-based computing, management operations are specified in terms of the objectives or goals that need to be met, rather than the detailed instructions that need to be executed. A number of activities have been undertaken to develop policy schemas and architectures in different domains, including: automated provisioning of computing systems, auditing of configuration constraints in Storage Area Networks (SANs), and guiding analysis and problem determination processes.


XMT - Policy-based Extended Web Services and Messaging Transactions
The XMT project addresses main challenges in transactional coordination, including: defining transaction policies in an XML Web services/messaging-based environment, managing transaction policies and corresponding actions through an effective middleware system, and integrating such transaction policy system with Web services and messaging-based transaction middleware.


DSF - Data Sharing Facility
DSF is an experimental project to build a server-less file system that distributes all aspects of file and storage management over cooperating machines interconnected by a fast-switched network. DSF is aimed at scaling to hundreds of machines using commodity components. DSF provides a global memory cache, distributed file management, and distributed storage repository.


SINTRA - Distributing Trust on the Internet
SINTRA (Secure INtrusion-Tolerant Replication Architecture) is a protocol suite for secure and fault-tolerant service replication in asynchronous networks such as the Internet. Using randomization, novel customized cryptographic tools, and optimistic methods, SINTRA provides the first practical protocols that do not rely on any timing assumption, while tolerating active coordinated attacks.

Related Publications  

Sumeer Bhola, Rob Strom, Saurav Bagchi, Y. Zhao and Josh Auerbach. Exactly Once Delivery in a Content-Based Publish-Subscribe System. Proc. International Conference on Dependable Systems and Networks, Washington D. C.. June 2002.

Melissa J. Buco, Rong N. Chang, Laura Zaihua Luan, Christopher Ward, Joel L. Wolf, Philip S. Yu, Tevfik Kosar and Syed Umair Shah. Managing eBusiness on Demand SLA Contracts in Business Terms Using the Cross-SLA Execution Manager SAM. ISADS 2003 - International Symposium on Autonomous Decentralized Systems. IEEE, November 2002.

Liana Fong, Michael Kalantar, Don Pazel, German Goldszmidt, K. Appleby, T. Eilam, S.Fakhouri and S. Krishnakumar. Dynamic Resource Management in an eUtility. Network Operations and Management Symposium. April 2002.

Jeffrey O. Kephart and David M. Chess. The Vision of Autonomic Computing. IEEE Computer 36(1), 2003.

Avraham Leff, James T. Rayfield and Daniel Dias. Meeting Service Level Agreements In a Commercial Grid. IEEE Internet Computing (Special issue on Grid Computing), July 2003.

Heiko Ludwig, Alexander Keller, Asit Dan, Richard King and Richard Franck. A Service Level Agreement Language for Dynamic Electronic Services. Electronic Commerce Research 3(1-2), April 2003.

Anton Riabov, Zhen Liu, Joel L. Wolf, Philip S. Yu and Li Zhang. New Algorithms for Content-Based Publication-Subscription Systems. ICDCS 2003 - The 23rd International Conference on Distributed Computing Systems . IEEE, December 2002.

Stefan Tai, Thomas A. Mikalsen, Isabelle Rouvellou and Stanley M. Sutton Jr.. Conditional Messaging: Extending Reliable Messaging with Application Conditions. Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS 2002, Vienna, Austria). IEEE, July 2002.

Recent accomplishments  


Alfred Spector receives IEEE Tsutomu Kanai award on April 9, 2003, in recognition of his contributions in the area of distributed computing systems.

Alan Ganek invited as keynote speaker at IM 2003, March 2003.

German Goldszmit was the Technical Program Committee Co-chair for IM 2003, March 2003.

Ron Levy, Jay Nagarajao, Giovanni Pacifici, Mike Spreitzer, Asser Tantawi, and Alaa Youssef , “Performance management For Cluster Based Web Services”, Received Best Paper Award at IM 2003, March 2003.