Title:
|
Some effective approaches on designing fault-tolerant digital circuits and systems
|
Continued technology scaling on integrated circuits (IC) resulted in various benefits towards
modern lifestyle. Smaller ICs made it possible to have daily-use devices at a small
size and lower price. However, higher wear-out, stress effects and varied operating environment
contributed towards shorter and severely limited lifetime. A possible solution
to alleviate this problem is to introduce fault tolerance in the system that provides resilient
towards the faults normally occur due to these effects. The main challenge here
is to provide adequate increment towards reliability without imposing much overhead.
To this end, this thesis presents several hardware and software approaches that improve
the reliability of a system and also provide resilience towards transient and permanent
faults.
We observed that the multiple-faults aware placement strategy improves the lifetime reliability
of digital circuits by lowering the error rate. We proposed several improvements
in the multiple-faults aware placement strategy to achieve faster processing and higher
reliability. These improvements are classified as hardware level approaches to achieve
fault tolerance towards multiple faults in digital circuits. An analytical method is proposed
using the Signal Probability Reliability Analysis (SPRA) that overcomes the issue
of long simulation time for profiling pairs of cells/ gates. This method runs with one order
magnitude faster than the original simulation approach. We also proposed the use of Hill
Climbing strategy after Simulated Annealing to reduce the observed wire length in the
original design. Experimental results show that this method can reduce the wire length
up to 61%. We also proposed a novel optimisation algorithm to reduce the error rate
by smartly manipulating the available spaces to separate the 'bad pairs' in the circuit.
We investigated on the level of 'bad pair' considered in the optimisation algorithm. We
found that with two categories of 'bad pairs', the error rate reduces up to 23% with little
simulation time overhead.
Checkpointing has been used over decades as one of primary software level approach
for mitigating the effect of transient faults in a system. We studied the effectiveness
of checkpointing in the view of lifetime reliability of a system than merely providing
fault tolerance. Here, we proposed a novel checkpointing mechanism, namely, Lifetime
Reliability-Aware Checkpointing Mechanism (LRAC), that is capable of not only tolerating
transient fault but also migrating the task to a spare host whenever a permanent
fault occurs or is expected to occur. We observed that this incurs approximately 12%
time overhead, only during the occurrences of faults, even when the fault rate is as high
as 10-3. However, this approach does not fail to meet the hard deadline of the tasks
being executed.
|