Self-Aware distributed systems

Innovation Matters


This project aims at automating an increasingly complex and expensive task of real-time problem diagnosis in large-scale distributed systems by using state-of-art machine-learning, probabilistic reasoning and information-theoretic approaches. Achieving self-managing capabilities in computer systems is the central goal of autonomic computing, which is considered an inevitable next step in the evolution of today’s computing systems, given their incredibly rapid growth in size and complexity.


Figure 1. Evolution of Computing: why do we need Autonomic Computing

Figure 1. Evolution of computing: why do we need Autonomic Computing

As distributed computer systems and networks continue to grow in size and complexity, systems management tasks such as real-time fault localization and problem diagnosis become much more expensive and complicated, and call for higher levels of automation. Autonomic computing is a rapidly growing research area that aims at creating self-managing systems, and generates a variety of challenging applications for the AI field. Our project focuses on machine-learning approaches to developing Self-Aware Distributed Systems by using real-time Bayesian inference and learning combined with information-theoretic methods for optimal test selection, and develops novel cost-efficient, adaptive diagnostic algorithms. We also investigate robustness issues in large, complex networks and suggest algorithms for improving robustness in the presence of both random failures and coordinated attacks.

A key property of an autonomic computer system is “self-healing”. However, self-healing starts with self-awareness: the ability of the system to perform real-time inferences and learning about its own behavior, to diagnose and predict various faults and performance degradations. The main challenges here include:
1. scalability problems arising in extremely large distributed systems containing hundreds of thousands of components
2. cost-efficient information processing of huge data streams of various measurements where only a small fraction of the data might be relevant to the problem at hand (“drowning in data but starving for knowledge”)
3. adaptivity to a dynamically changing environment

We developed cost-efficient techniques that allow good trade-offs between diagnostic accuracy and cost of collecting evidence. Using a Bayesian network framework, we approach diagnosis as a probabilistic inference problem of recovering the most likely state of unobserved system components given the outcomes of 'probes', where probes are end-to-end transactions that test the availability and performance of a particular service (e.g., traceroute, web-, email- and database-access transactions, as well as e-business transactions). Figure 2 provides a simple example demonstrating our approach. Probe station 1 issues a probe (shown in yellow) testing the functionality of the web server X6 (e.g., opening a web page); the probe also goes through some other components such as routers X5, X2, X3 and X4. Another probe (shown in green) is issued by the probe station 2 and tests the database server, going through nodes X7, X4, X3 and X2. We can make inferences about the most likely node failures based on the outcomes of the probes: e.g., if both probes fail, and the probability of failure is equal at each node, the most likely diagnosis would blame the nodes common to both probes (X2, X3 or X4), otherwise the nodes on the failed probe which are not included in the OK probe must be blamed. In a noisy environment, there is a small probability of the probe outcomes being incorrect (e.g., probe failing even though its components are OK due to some unknown reason). We model the problem of probe-based diagnosis as finding a most-likely state of hidden variables in a Bayesian network, where hidden variables (upper layer in Figure 2) correspond to the component’s states, and observed variables (lower layer) correspond to probe outcomes.


Figure 2: End-to-end probes provide the input for a Bayesian inference engine that makes real-time inferences and outputs the most-likely diagnosis of the current system’s state.

Since it is important to minimize costs associated with probing, we explored several approaches to optimal probe selection, i.e. finding a minimal number of most-informative tests. Although the problem is NP-hard in general, simple greedy search based on information gain seems to work well in practice.

An important property that differentiates our approach from ''passive'' data analysis is its ability to perform adaptive online test selection and execution, called active probing, which allows significant reduction (up to 75% in most of our applications) in the number of probes and diagnosis time. Moreover, we developed an efficient approximate algorithm for most-informative test selection in general Bayesian networks with an arbitrary number of possible faults, where even the computation of info gain of a single probe may become intractable.

The key feature of our approach is its focus on achieving flexible and controllable trade-offs between multiple objectives, such as the accuracy of diagnosis versus its computational complexity and the cost of obtaining the information. We developed theoretical results characterizing such trade-offs (e.g., the number of probes necessary to achieve an asymptotically error-free diagnosis), as well as cost-efficient approximation techniques for handling intractable inference problems in large-scale diagnosis.

Our ongoing work includes extending the approach to distributed (collaborative) reasoning and learning that uses a “divide-and-conquer” approach to scalability in large-scale systems, and uses belief-propagation to combine local beliefs about the system's state into a more accurate 'global' picture of system's health.

Figure 3 shows an architecture of our diagnostic system called RAIL (Real-Time Active Inference and Learning), which uses the probe outcomes to make inferences about the system state, and actively requests the next most-informative probes to improve its diagnosis.
The learning component is currently under construction.


Figure 3: Architecture of the Real-Time Active Inference and Learning (RAIL) system: using a real-time event stream of various observations of the system’s behavior (such as end-to-end probe outcomes obtained by the sensing tool) and a Bayesian network dependency model, the RAIL system: (1) makes online inferences about the underlying faults and performance degradations of the system’s components that might be difficult or expensive to measure directly, (2) actively requests the most-informative tests to improve the diagnosis and (3) updates the dependency model using online learning.


Besides developing diagnostic techniques, we also investigate the robustness of complex networks with scale-free topology that are common in practice (e.g. Internet, Web, and some peer-to-peer systems such as Gnutella) to random failures and coordinated attacks. Figure 4a shows an example of such a network – a connectivity graph for a peer-to-peer file download system. We developed several network modification schemes that improve the robustness by edge rewiring and edge addition, and investigated the corresponding cost-efficiency trade-offs. We found that a modest alteration of an initially scale-free network can usefully improve robustness against attacks, particularly when the fraction of attacked nodes is small, and we identified modification schemes that are most effective for this purpose (for details, see our Physica A paper). Figure 4b shows the improvement achieved in the network robustness under attacks of various severity levels.

Figure 4

Figure 4. (a) An example of a scale-free network topology: connectivity in a peer-to-peer Gnutella network (sample crawl over 435 nodes). (b) Modifying a network to increase robustness against an attack that removes network ‘hubs’ (nodes with highest connectivity). Robustness is measured by the fraction of nodes that remain in the largest connected component after the attack (Y axis), while the severity of an attack is measured by the percentage of removed highest-connectivity nodes (X axis).

Since 2004 the project has been supported by the Adventurous Research program.

Related Publications  

Alina Beygelzimer, Geoffrey M. Grinstein, Ralph Linsker and Irina Rish. Improving Network Robustness. International Conference on Autonomic Computing (ICAC-04). May 2004.

Alina Beygelzimer, Geoffrey M. Grinstein, Ralph Linsker and Irina Rish. Improving Network Robustness by Edge Modification. Physica A 357(3-4):593-612, April 2005.

M. Brodie, I. Rish and S. Ma. Intelligent probing: A Cost-Efficient Approach to Fault Diagnosis in Computer Networks. IBM Systems Journal 41(3):372-385, 2002..

Mark A. Brodie, Sheng Ma, Irina Rish and Alina Beygelzimer. Test-based Diagnosis: Tree and Matrix Representations. IM 2005 - International Symposium on Integrated Network Management. November 2004.

Natalia Odintsova, Irina Rish and Sheng Ma. Multifault Diagnosis in Dynamic Systems. IM05 - International Symposium on Integrated System Management. November 2004.

I. Rish, M. Brodie and S. Ma. Accuracy versus efficiency in probabilistic diagnosis. Proceedings of National Conference on Artificial Intelligence. July 2002.

Irina Rish, Mark A. Brodie, Sheng Ma, Natalia Odintsova, Alina Beygelzimer, Genady Grabarnik and Karina Hernandez. Adaptive Diagnosis in Distributed Systems. IEEE Transactions on Neural Networks, March 2005.

Irina Rish. Distributed Systems Diagnosis Using Belief Propagation. Allerton 2005 - 43-d Annual Allerton Conference on Communication, Control and Computing, 2005. June 2005.

I. Rish, M. Brodie, N. Odintsova, S. Ma and G. Grabarnik. Real-time Problem Determination in Distributed Systems using Active Probing. Proceedings of NOMS. April 2004.

Alice Zheng, Irina Rish and Alina Beygelzimer. Efficient Test Selection in Active Diagnosis via Entropy Approximation. UAI-2005 (Uncertainty in AI). March 2005.


Rate this article

Innovator's corner  

Irina RishIrina Rish Researcher

What is the most exciting potential future use for the work you're doing?
Truly autonomic, self-healing computer systems able to perform real-time problem determination and performance managment with minimal human intervention.

What is the most interesting part of your research?
Theoretical analysis of optimality of our approaches - when it actually can be done, and you can get a sense of cost-efficiency trade-offs involved. Brainstorming session with your team members and other IBM Research colleagues on machine learning and information-theoretic problem formulations. And of course, watching your diagnostic system to perform 'smart' actions such as choosing next most-informative question to ask.

What inspired you to go into this field?
I was alsways interested in decision-making under uncertainty, problem solving and 'decoding' messages the nature 'sends' to you in order to understand 'hidden causes' explaining what you see - and especially doing so in an automated way. Although I should admit my interest in artificial ingtelligence was mainly triggered by the desire to understand - and improve if you can - the human intelligence. (Well, if you often have trouble with real-time decision making in your life, it is quite likely you may choose it as your research topic.)

What is your favorite invention of all time?
Computer. Well, I might be a bit biased in answering this question.

Research team  

Irina Rish
Alina Beygelzimer
Gerry Tesauro
Rajarshi Das

Former members:
Shang Guo
Natalia Odintsova

Collaborators:
Herb Lee
Ralph Linsker
Geoffrey Grinstein
Genady Grabarnik

Related Research