IBM Journal of Research and Development
IBM Skip to main content
  Home     Products & services     Support & downloads     My account  

  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
  Related links:  
     IBM Research  

IBM Journal of Research and Development  
Volume 50, Number 2/3, Page 199 (2006)
Exploratory Systems Research
  Full article: arrowHTML arrowPDF   arrowCopyright info





   

Reliability of modular mesh-connected intelligent storage brick systems

by C. Fleiner, R. B. Garner, J. L. Hafner, KK Rao, D. R. Kenchammana-Hosekote, W. W. Wilcke, J. S. Glider
A key objective of the IBM Intelligent Bricks project is to create a highly reliable system from commodity components. We envision such systems to be architected for a service model called fail-in-place or deferred maintenance. By delaying service actions, possibly for the entire lifetime of the system, management of the system is simplified. This paper examines the hardware reliability and deferred maintenance of intelligent storage brick (ISB) systems assuming a mesh-connected collection of bricks in which each brick includes processing power, memory, networking, and storage. On the basis of Monte Carlo simulations, we quantify the fraction of bricks that become unusable by a distributed data redundancy scheme due to degrading internal bandwidth and loss of external host connectivity. We derive a system hardware reliability expression and predict the length of time ISB systems can operate without replacement of failed bricks. We also show via a Markov analysis the level of fault tolerance that is required by the data redundancy scheme to achieve a goal of less than two data loss events per exabyte-year due to multiple failures.
Related Subjects: Computer architecture; Computer organization and design; Computer system availability