 |

eWLM: Enterprise Workload Management
By Vic Chase
| |
When that annoying window pops up on your computer monitor advising you that there is "not enough memory available," it may not be telling the truth. And the message that interrupts your Web browsing to tell you "the server may be busy" is not entirely accurate either. Insufficient hardware capacity frequently catches the blame for these and other banes of the computer user's existence—but the true culprit is more likely the inefficient allocation of resources. Your computer may not be running out of memory; it simply may not have enough devoted to the application at hand. Similarly, while the server you reached during a Web surfing session may be overloaded, chances are another one on the same network is sitting idly by.
In the world of large-scale commercial computing, such inefficiencies have more profound ramifications, which are magnified as the complexity of computer hardware systems increases. To rectify these problems, IBM developed workload management to optimize the use of computer hardware. Workload management (WLM) is intelligent software that monitors and adjusts the allocation of resources such as the central processing unit (CPU), memory, and input/output (I/O) flow on an ongoing, split-second basis.
Spiraling circumstances
When mainframe computers were introduced into the business world during the 1960s, the tasks they performed, such as cutting payroll checks and keeping basic databases, were relatively simple. Scheduling computer tasks was an easy process, controlled by the order in which punch cards were fed into a computer.
But then came word processing, complex database management, computer graphic capabilities, e-mail, and the Web, all of which created competition for computer time. Consequently, companies purchased more hardware and software to respond to these needs and hired more people to allocate the computer resources. These system administrators had to make decisions as to which users received priority for particular tasks.
"The complexity of the large systems increased to the point that it was very difficult for a group of individuals to manage the hardware and operating systems effectively," says Jeff Aman, an IBM distinguished engineer with the server group in Poughkeepsie, N.Y.
Programmed to learn
IBM researchers at the Watson Research Center in Yorktown Heights, N.Y., along with the product development group in Poughkeepsie, anticipated this problem 20 years ago. They set about developing a WLM system for mainframe computers that could learn on its own and respond to changing conditions in seconds.
"Building such intelligent software is like creating artificial intelligence," says Donna Dillenberger, a senior technical staff member at the Watson Research Center, who has been involved in WLM development for 10 years. "It is programmed to learn about the delays that occur, and to figure out what resources it has to allocate to reduce response times." As a result, she explains, "An administrator doesn't have to say, 'Let's set all these configuration knobs.' Instead, all he or she has to say is, 'I would like all the Web requests to send a response back within four seconds,' and learning algorithms determine the best configuration to satisfy those goals." An algorithm can "learn" and change its behavior by comparing the results of its actions with the goals that it is designed to achieve.
This is accomplished through the use of what is known in the software business as "evolutionary computing learning heuristics." An adaptive system is one that can learn on its own, utilizing a variety of feedback mechanisms, rather than by rote. Such a program learns from the user, much in the same manner as speech recognition software or even the spell check feature of a word processing program.
In essence, says Dillenberger, "The program gathers data about its environment, using real-time plots to determine what has changed in the environment. It then figures out the next steps it has to take to change the environment to reach a desired goal." For example, WLM decides which resources (system, network, load-balancing patterns) to adjust in order to enable a system to meet end-to-end performance goals.
Setting rules
To enable WLM to learn on its own, an administrator first feeds a set of rules into the system via a graphical user interface, known as a GUI—a user-friendly part of the software that allows the information to be typed. The rules allow a business to specify different qualities of service for various customer segments by defining a number of service classes. Each class includes business importance and performance goals.
Once these classification rules are defined, WLM automatically runs them on all incoming work, explains Steve Heisig, a WLM senior programmer at the Watson Research Center. Given the parameters WLM receives from the administrator, it assigns a service class to the work and processes it accordingly.
Using this system, a financial company's administrator, for example, can instruct the WLM feature to admit only account-holding customers to its Web site when its servers become overloaded, while holding "just browsing" individuals at bay until the server load lightens. Or it can choose to provide a preferred customer segment with a three-second response time, while another segment may get a five-second response time. In this case, the customer segment is the business importance portion of the equation, while the specified response time is the performance goal.
Quarter-second sampler
When you consider that hundreds of users running hundreds of applications simultaneously on a single mainframe can call upon the WLM software to make thousands of decisions, the magnitude of WLM's task becomes apparent.
To make sure things are kept on track, WLM includes a "sampler," which takes samples of the processes running on the system every quarter second and figures out if work is delayed by any of the resources that it manages, explains Heisig. Then, every 10 seconds WLM runs a policy adjustment code that compares the sampler statistics to service classes to determine if performance goals are being met. If not, it looks for the source of the delay and addresses it by adjusting or reallocating resources. As this takes place, the intelligent software makes instantaneous decisions as to who wins and who loses CPU capacity when the inevitable trade-offs are made.
Using its built-in intelligence, WLM may, for example, usurp some CPU time from a computer running a payroll overnight, to respond immediately to a person at a keyboard requesting information, even though the payroll is more important than the user's request. This is possible because the WLM software understands that it has all night to run the payroll, and that taking half a second of the CPU's time to respond to the user will not delay the goal of processing the payroll by morning.
Homestead farming
Because of the complexity of WLM software, the first system took 10 years to make and was eventually introduced in 1992 as part of IBM's OS390 operating system, which consists of a mainframe computer and its system software.
Yet the creation of WLM is an ongoing, iterative process. The most recent challenge for developers lies in adapting WLM to the transition from large mainframe computers to distributed computing—the use of disbursed groups of networked servers, sometimes referred to as "server farms" or "homesteads."
"When a customer makes a request on a browser or at an ATM machine, it may go to a Web server, and the server may initiate work in a back-end database server, or it may initiate work in a transaction manager," explains Dillenberger. But despite the fact that these networked servers can be located at opposite ends of the Earth, the goal is to make the transaction smooth and transparent to the end user.
To deal with this heterogeneous mix, IBM is currently preparing an updated WLM system—known as heterogeneous, or enterprise workload management—to provide the same optimization to networked machines that WLM offers mainframes.
But doing so is a far from simple task. "The workload management problem has changed from, 'How do I keep this large mainframe alive and running well?' to 'How do I coordinate the activities of a large number of heterogeneous servers to meet my overall business objectives?'" says Aman.
Adding to the intricacy of the developers' task is the fact that in distributed computing configurations, many of the networked servers run different operating systems, including IBM's z/OS (formerly OS390), Unix™, Linux®, and Windows®.
Nonetheless, says Dillenberger, "When you have distributed requests that must be coordinated from end-to-end and finish a transaction within four seconds, these WLM algorithms will cooperate with each other to make sure that the total time is in fact four seconds."
This is accomplished using essentially the same rules classification process as the mainframe WLM system. "In this case, however, you have a server in the configuration that plays a management role," says Aman. At the same time, he notes, "We expect the individual servers to automate the management of their own performance in the context of the end-to-end distributed flows, which can be a very difficult problem."
While IBM prepares to beta test an eWLM product with several large customers later this year, one may reasonably question the need. After all, don't most Web, ATM, and other computer transactions already take place quickly enough? The answer is yes, but at considerable cost.
"Companies are buying thousands of computers just to be able to give their customers remote access to the merchandise or data that they would like to buy," says Dillenberger. Companies implementing eWLM will be able to reduce the amount of time their administrators spend to locate performance bottlenecks and optimize workloads. Their current hardware utilization will increase as eWLM optimally changes resource settings across their existing hardware.
Autonomic and Grid Computing
Workload management will also play a key role in developing autonomic computing. One of IBM's most important project areas, autonomic computing derives its name from the autonomic nervous system, which regulates many functions of the human body, such as heart rate, breathing, perspiration, and the digestive process. Autonomic computing seeks to bring automated, self-monitoring harmony to complex computer systems. One of the attributes of an autonomic computing system is self-optimization. WLM provides just this sort of self-tuning.
Another cutting-edge concept ripe for the versatility of WLM is grid computing. Currently gaining popularity in university and scientific circles, grid computing is a process by which individual institutions use open protocols to share over the Internet, or over private networks, access to computing resources—everything from supercomputers, to data, files, applications and storage. Essentially, the grid can be compared to a utility, but instead of purchasing electricity or natural gas, one purchases computing time. Sharing these compute resources effectively will require self-optimizing middleware like WLM.
Workload management would automatically direct the flow of work through the grid much as the flow of electricity is directed through transmission lines, albeit on a much more complex basis.
"As capacity is required, WLM would allow it to be dynamically obtained," explains Aman. "Essentially, it would be a self-managing autonomic environment that can dynamically grab capacity off the grid."
IBM is currently working on grid projects in conjunction with a number of universities and other labs. One of them involves creating a national repository of digital mammographic data.
Irving Wladawsky-Berger, VP, Technology and Strategy, IBM Server Group, outlines the advantages of storing this information digitally. "It becomes accessible over the network to people who have the proper security credentials," he says. "Most important, there will be a whole set of advanced applications developed that can compare different mammograms in time to detect features that you would probably miss if viewing the mammogram at just one point in time. If multiple physicians, perhaps scattered across the country, need to consult and view a particular mammogram, this grid will give them access to the information as if they were in the same room."
A Commerce Grid
In the not too distant future, according to Wladawsky-Berger, the grid will find its way into commercial use, following much the same path taken by the Internet. "The Internet came from the research community, from the university community, from the government labs, and then, over time, it moved into the wider commercial world," he says.
In fact, this process is beginning to evolve under the guise of "Web service applications." According to Wladawsky-Berger, these are "very much in the spirit of the grid initiative, in that they are focused on sharing resources over the Internet, especially applications."
He cites Galileo International, a travel agency portal for online reservations, as an example of a forerunner of the grid in commerce. "We are already doing work with companies like Galileo on opening up their reservations systems to allow travel agents to integrate their applications with those of Galileo in the simplest way possible," he says.
Eventually, the grid may serve as one big virtual computer for the entire digital world, encompassing both the scientific community and the realm of commerce, as well as the individual user. In this scenario, individuals will have client devices with which to connect to the grid, but the memory, storage and applications will reside on the Internet.
No longer will you need to worry about purchasing the most recent version of every application and checking regularly for updates. A service provider will perform these functions automatically, and the computing power at your disposal will be virtually limitless. And, of course, workload management will be there to facilitate the process.
In the final analysis, achieving the optimal utilization of increasingly powerful and complex computer systems—be they mainframe, distributed or grid—will be a formidable challenge. "This is a massive problem, and it's something that's going to evolve for a long period of time," says distinguished engineer Aman. "And it's the kind of thing that has to be cooperative. It's not an IBM-only solution—we need partnerships with vendors, with people who do distributed systems management today, and with those who create the applications that predominate in the distributed environment."
And with its universal applicability, WLM will eventually render those annoying "out of memory" messages a distant memory.