By developing innovative technology to connect individual computers
together into clusters,
IBM researchers are providing
new ways to create scalable,
high-availability systems.
In Brief:
A cluster consists of multiple computers linked together and managed in such a way that makes the cluster fault-tolerant, or highly available, while appearing to users as though it were a single computer. Jointly with the S/390 Division, IBM researchers began developing clustering technology for the IBM System/390® Parallel Sysplex®. Research has also helped develop clustering for AS/400® machines and has played a significant role in developing clustering concepts for IBM's RS/6000(TM); Scalable POWER-parallel Systems® (SPTM). Watson researchers also played a role in developing the RS/6000 Cluster Technology, a generic cluster management software that IBM hopes to develop as the industry standard for UNIX® clusters.
There was a time, midway through the 1980s, when scientists thought that the way to build the fastest, most efficient computer was by creating the fastest, most efficient processors. That idea came to be challenged, however, by an approach based on clustering. The concept has grown since its inception to mean many different things to different people, but at its heart it is quite simple. "It means putting machines together so that they appear to the user as a single machine," says Alain Azagury, a scientist at IBM's Haifa Research Laboratory.
Clusters are created with two goals in mind. The first is horizontal growth, or scalability. "In order to obtain more processing power you can add one or more machines instead of replacing your existing machine," explains Julian Satran, another researcher at Haifa. The second aim is high availability. "When one of the machines in a cluster fails," adds Satran, "the failure is detected automatically, and applications running on the failed node recover on another machine in the cluster."
Cost is also an important motivation behind clustering, points out Tushar Chandra, manager of the distributed and cluster systems group at Watson. "Forming a cluster out of a bunch of commodity processors can be a lot less expensive way to obtain higher performance than building a faster, dedicated processor."
The clustering concept grew naturally out of scientific computing. When researchers thought about how to solve the most complex computational problems the world had to offer - modeling global weather, for instance - enlisting multiple processors was an obvious way to go about it. If one processor isn't enough to solve the problem, more might help. "Because scientific computations naturally decompose themselves into smaller problems, clustering is an especially good approach," says Ajei Gopal, senior manager of the distributed systems and services group at IBM's Thomas J. Watson Research Center. "For example, certain common scientific calculations can always be divided up into a number of smaller problems whose solutions can then be combined together to give the right answer."
Clusters are also ideal for Web servers. "If you're using your computer for batch processing, for figuring salaries at the end of the month, you can afford some down time," says Marc Snir, senior manager of Watson's parallel systems group. "If your Web server is accessed interactively from everywhere in the world, it has to be up all the time."
The Parallel Cluster
IBM's SP product line is based squarely on the clustering concept. Each SP machine is a full-fledged computer that consists of a cluster of RS/6000 nodes. With the debut of the SP, IBM researchers developed a series of services that would make the cluster look like a single computer to the user and the application. For instance, the Network Dispatcher monitors the load on the processors in the cluster and makes decisions about which RS/6000 node or nodes to route data and work. The responses from those nodes then bypass the dispatcher and go back directly to the user. "As far as you're concerned," says Dan Dias, manager of the parallel commercial systems group at Watson, "you see the cluster as a single machine, when in reality it's a bunch of machines working together. The Network Dispatcher makes it transparent to you."
Along the same lines, IBM researchers developed ways to share the data so that every node can access any necessary data in the cluster. In the S/390 Sysplex, it is done through physically sharing disks. For the SP, the idea was to create what's known as a virtual shared disk. "If each processor can access any data," says Dias, "then each can do a bit of the work and you can get a huge speed-up. The problem with most disks is they're typically connected to only one computer. The virtual shared disk is software that makes it look as if each disk has a connection to all computers."
To make the virtual shared disk fault-tolerant, each physical disk in the system is connected to two separate processors. If either fails, says Chandra, "the other could continue to export the data. Each of the two computers can be viewed as a gateway to the disk. As long as one is up, the other computers have access to that disk. It's much less expensive than connecting every node to the disk drive, and yet the performance is practically the same."
From Proprietary To Generic
The traditional approach to clustering relied on dedicated hardware and proprietary software to run it, and in some instances there are advantages to this approach (see "Clustering for AS/400"). Last year, however, IBM announced a generic clustering solution, called RS/6000 Cluster Technology (RS/6000 CT). "The motivation behind the technology was to create a low-cost clustering infrastructure that would run, not just on the SP, but on any kind of hardware," says Chandra. "The entire architecture assumes commodity components. It doesn't care which computer you're running it on or which kind of network." Indeed, IBM is now pushing RS/6000 CT as the industry standard management infrastructure for UNIX clusters. "It was designed to run anywhere and we would like to run it everywhere," says Chandra.
The first prototypes of RS/6000 CT were tested in 1994, and the product was first shipped in October 1996. It was developed by Gopal's group in the distributed systems and services department at Watson in a close, ongoing collaboration with the SP development organization at Poughkeepsie. "We continue to work together to ensure that the product is able to meet all the customer requirements encountered in the marketplace," explains Chandra. "It's an outstanding example of a partnership. I feel we owe our success to them, and they have expressed a similar feeling."
Because a cluster is a group of individual machines that one wants to be able to view as a single machine, any cluster needs management tools that allow the system administrator to monitor resource utilization, and do load balancing and system management. RS/6000 CT provides the core set of services to do that, and, in doing so, it also detects the failure of components and orchestrates the execution of recovery procedures.
RS/6000 Cluster Technology addresses two different classes of applications: those that are cluster enabled and those that are not. Cluster-enabled applications exploit RS/6000 CT to hide failures seamlessly. For example, cluster-enabled applications typically run simultaneously on many machines in the cluster. When one machine fails, the work being done by that machine is transferred seamlessly to another machine in the cluster.
Remarkably, RS/6000 CT also provides a more limited, although useful, level of fault-tolerance for applications that were not written with clustering in mind. When the primary machine fails, RS/6000 CT will automatically bring up a copy of the application on a backup. During recovery, which typically takes from 0 seconds to a few minutes, the application is not available, but after recovery it is again available. Since a short recovery period is acceptable in most usage scenarios, that solution is usually adequate for legacy and other non-cluster-enabled applications.
Going The Distance
The future of RS/6000 CT has three facets. For starters, the RS/6000 CT team is looking for ways to improve and expand the existing clustering infrastructure - for instance, creating services to help applications run on scalable clusters and manage scalability.
Second, IBM's Software Solutions Division (SSD) is working closely with different platform groups within IBM, as well as with other hardware vendors, to exploit the benefits of RS/6000 CT on a variety of platforms. In addition, SSD will work on packaging the RS/6000 CT infrastructure and other elements into a set of portable cross-platform clustering services that can be bundled with middleware such as Lotus Notes and DB/2.
Indeed, while much of the discussion about RS/6000 CT has been on clustering of UNIX workstations, RS/6000 CT has also been demonstrated on Windows NT®. At last year's PC Expo, Gopal points out, IBM demonstrated RS/6000 CT on a four-way PC cluster, whereas Microsoft's Cluster Server product (formerly known as "Wolfpack") was running only on two-way clusters.
The third aspect of RS/6000 CT's future involves the Internet which is, after all, nothing more than a planetwide, albeit very loose, cluster of computers. "We are looking at connections among computers that would have reason to collaborate over the Internet," says Chandra. "For example, say I charge an airline ticket on my IBM American Express card using the Internet. That transaction involves three companies that need to communicate with each other: IBM, American Express and the airline. To build the infrastructure to facilitate such applications, we will use the kinds of techniques that were developed as part of our clustering technology."
Gary Taubes is a freelance writer based in Boston. His latest book is Bad Science: The Short Life and Weird Times of Cold Fusion.
More Information:
Clustering for AS/400
Theory Behind Performance
Clustering for AS/400
Clustering offers two advantages: scalability and reliability. "With clustering, whenever you need more computing power, more storage space, more terminals, you simply add another machine," says Julian Satran, Senior Technical Staff Member at the Haifa Research Laboratory.
Scalability was a criterion for a clustering project started in the late 1980s by IBM's AS/400 Division, in Rochester, Minnesota. The objective was to share data transparently among two or more AS/400 machines, thereby enabling a cluster to grow without the need to reconfigure the setup every time a new machine is added. A Haifa team led by Satran developed the concept of "shared virtual storage," in which storage is spread across all the machines in a cluster and backed up by disks.
Next, the Haifa group tackled the problem of high availability. The result: a concept called ASP (for auxiliary pod storage), which involves laying memory down differently in different machines. That permits a storage disk and its contents to move from one machine to another, ensuring that data is not lost even temporarily if a machine goes down.
Meanwhile, Alain Azagury, group manager of distributed systems at Haifa, worked with the AS/400 team on remote journaling - the transfer between machines in a cluster of operations, such as bank transactions, that must not be lost under any circumstances. His scheme involves sending the journal entries in such a way that the second machine acknowledges receiving the data before the first machine can carry out another transaction. An extension of this work which devised a method of transferring data to backup machines if the primary machine fails - even if the failure occurs in the middle of a transaction - helped lead to a prototype version of the Highly Available AS/400 cluster. The Haifa work has been incorporated in Opticonnect/400, a product designed to facilitate the build-up of AS/400 clusters, and more of Haifa's technologies will be incorporated in the near future in products.
The Theory Behind Performance
"Performance is a key issue in all of this," says Mark Squillante, manager of the systems
design, analysis and theory group at IBM's Thomas J. Watson Research Center. Squillante has been working on the theoretical analysis and optimization of scheduling and resource management for clusters. As he says, the goal is to optimize cluster performance and incorporate the solutions in real systems.
"But there are different performance
objectives," Squillante points out. "For example, we may want to minimize response time, maximize throughput or realize some combination of objectives. This makes cluster scheduling problems even more complex, as there is never one single objective."
In the case of the simple goal of balancing a workload across the nodes of a cluster to prevent any one node from becoming a
bottleneck, the scheduler might opt for what's called affinity scheduling. That routes jobs to the nodes in the cluster that are best suited to handle them. It might happen, however, that every node in the cluster could tackle the particular application but that one node already has the necessary data loaded in its memory.
"Since that node doesn't have to access the data from the disk, it would perform the task faster," explains Squillante. "Yet, it's not always that simple, because that node might be overloaded while others are underloaded. Hence, there's a fundamental trade-off between affinity scheduling and load balancing."
The situation becomes further complicated with parallel applications, which introduce a new set of considerations. For example, how many nodes should be allocated to any single job and how should they be allocated? Do all the allocated nodes execute the job simultaneously, or do tasks of the job independently queue up for nodes that will execute the tasks when they can get around to it?
"It depends heavily on the application environment," says Squillante. "We have an optimal policy for one class of applications, and we've been working on finding the optimal degree of space-sharing and time-sharing given the characteristics of another class of applications and the system load."
Squillante and some of his experimental colleagues in IBM Research, such as Liana Fong, Nayeem Islam and Pratap Pattnaik, have come up with scheduling approaches that are theoretically optimal and perform extremely well in practice. One specific approach to "gang scheduling" takes advantage of dynamic space- and time-sharing, so that the allocation of the workload can change with the circumstances. For instance, the number of nodes allocated to a job could be adjusted to maintain a certain level of node efficiency, while the degree of time-sharing could be adjusted to achieve some form of shortest service time first. "You can often prove that such adaptive combinations of space-sharing and time-sharing are required to achieve the best performance for certain application environments," says Squillante.
While many aspects of the work are still at the prototype stage, the analysis and experiments suggest that some of these optimal scheduling methods could provide as much as two- to three-fold improvements in performance.