Skip to main content

IBM Israel Research Seminars

 

The problem of adding fault-tolerance properties to distributed applications hasbeen the focus of much academic work for three decades.
With the introduction of middlewares as abstractions that encapsulate various operational aspects of a distributed system, a natural tendency began toward moving fault-tolerance support from the application level into the middleware. Our research question is thus how to extend a given middleare with fault-tolerance capabilities. We aim for a lightweight and efficient design that is both portable and capable of good performance.

Our research led to the design of a novel approach named the object-adaptor approach, which unifies many of the advantages of previous approaches while avoiding some of the pitfalls. We constructed a prototype called FTS (Fault-Tolerance Service) implemented in CORBA/Java, extended it and analyzed its properties. Although FTS uses CORBA-specific facilities such as an object-adaptor, its architecture is completely portable as a form of an interception mechanism to other platforms such as .net and J2EE.

Our initial naive design indeed provided good portability and application transparency, but failed to achieve good performance. Studying the behavior of FTS in detail, we identified several problems that are general in nature and whose solutions can be applied to many types of middleware services. My talk, after a short overview of FTS, will focus on these problems and on our solutions, which allow FTS performance to improve significantly without sacrificing portability and transparency.

I will start with some initial performance data and point out performance limitations that result from FTS logic being kept at the application level above the broker. Then I'll present a case where a small optimization in the seemingly negligible stages of serialization and de-serialization caused an unexpected large improvement of cluster throughput.

Next, I will talk about using SACKs (Selective ACKnowledgements) to reduce the size of the reply-cache of FTS servers, which stores completed requests for replying to a client upon re-invocation of the same request in order to ensure at-most-once request execution semantics. This cache can become a scalability inhibitor due to its excessive memory consumption, for example in stateless application servers (where the state is in the third database layer).

Last, I'll introduce a technique for channel-independent message batching that features inherent adaptability to message arrival rate. This technique is used in FTS to boost the throughput of the inter-server ABCAST transport, and thus of the cluster in general. However, it is also applicable in entirely different contexts, e.g., to batch multiple HTTP updates into TCP connections.

About the Speaker
Erez Hadad is a graduating PhD student from the Computer Science Department in the Technion. His PhD is titled "Architectures for Fault-Tolerant Middleware Services" and was conducted under the supervision of Dr. Roy Friedman. The research work deals with adding fault-tolerance capabilities to existing middlewares, using a novel design that aims for both middleware portability and performance.

Prior to returning to the Technion in 2000 for his Master and later PhD (direct track) studies, Erez has been employed for 6 years as a software engineer in RAFAEL, Israel's armament development authority, where he was involved in several large-scale projects involving both hardware and distributed software systems and won a few work-excellence prizes. Before that, Erez has graduated from the Technion with a B.Sc. in Computer Science (summa cum laude).