Coding Approaches to Fault Tolerance in Dynamic Systems
Fault tolerance has been a long standing necessity in system design and operation. In systems with memory (state), however, modular redundancy and other traditional approaches to fault tolerance are undesirable not only because they are expensive but also because they rely heavily on the assumption that the error-correcting (e.g. voting) mechanism is fault-free. This talk presents a general framework that systematically addresses these issues in fault-tolerant discrete-time dynamic systems. Our approach replaces the original system with a redundant implementation in which the state and next-state transition mechanism of the original system are encoded. Error detection, correction and/or reconfiguration are then based on detecting and identifying violations on the state encoding of this redundant implementation. One of the implications and advantages of this approach is that, unlike traditional methodologies that rely on concurrent checking at the end of each time step, this approach allows the construction of redundant systems in which detection and identification of errors is based on non-concurrent (e.g. periodic) checks. In the resulting design, the checker can operate at a slower speed than the rest of the system, which relaxes the stringent requirements on its reliability. We demonstrate this approach in the context of linear dynamic systems and arbitrary finite-state machines. We also discuss how our framework can handle failures in the error correction mechanism, enabling in this way the construction of reliable systems exclusively from unreliable components. In particular, we construct reliable LFSMs using unreliable XOR gates and voters in a way that requires only a bounded amount of redundancy per machine and achieves arbitrarily low probability of failure during any finite time interval. The talk will conclude with open problems and directions for future research.
Friday, March 21, 2003
3:30 – 4:30 p.m.