Title:
|
Self-correcting strategy for networks-on-chip interconnect
|
Networks-on-Chip (NoC) interconnection provides an on-chip
communication strategy for a large number of processing elements System-on-
Chip. Fault tolerance is a challenge for modern NoCs due to the increase
in physical defects in advanced manufacturing processes. A key requirement
for modern NoCs is the ability to detect faults and failures and to self-correct
after faults occur thereby maintaining a level of system functionality.
However, existing fault-tolerant approaches cannot fully address system
scalability and fault testing with minimal intrusion, in addition they fail to
provide robust self-correction strategies under complex traffic conditions.
Therefore, it is necessary to look to new fault detection and self-correction
strategies to address this reliable design issue and to enable the design of
reliable systems on unreliable fabrics.
This thesis presents a novel online fault detection strategy where the
intrusion of the runtime operation under testing is minimised. If the channel
is faulty, an alert flag is raised. By using this alert flag mechanism, three novel
fault-tolerant adaptive routing algorithms are proposed to provide selfcorrecting
strategies for NoCs. They exploit the status of real-time traffic with
different levels (local or regional) look-ahead functions, then calculate weights
for output directions or path candidates, and choose the path with the lowest
weighting to forward the packets. The key benefit of these routing algorithms
is to bypass a routing path with faulty channels while minimising congestion
for the adjacent connected channels. The detailed experimental results are
given for a range of testing conditions, traffic patterns and fault rates, which
demonstrate that the faults can be detected promptly with minimal intrusion
and the routing algorithms are able to maintain a level of system functionality
under high fault rates with a low cost. In particular, experimental results
demonstrate that the proposed detection and self-correction strategy achieves
an overall between 24%-62% improvement on throughput degradation under
varied high fault rates compared to benchmarks.
The thesis also presents an open-source monitoring mechanism which
provides an evaluation and benchmarking mechanism to quantitatively
analyse a hardware NoC system's fault-tolerant capability. By using this
monitoring mechanism, the thesis concludes with hardware verification of the
detection and self-correction algorithms in FPGA hardware. The FPGA
implementations present the throughput performance, fault-tolerant
capabilities and resource costs of the three different fault-tolerant adaptive
routing algorithms, in particular, the implementations demonstrate the realtime
operation of the proposed self-correction strategies in hardware while
under the presence of varied levels of faults.
|