Respected readers, authors and reviewers, you can add comments to this page on any questions about the contribution, review, editing and publication of this journal. We will give you an answer as soon as possible. Thank you for your support!
ZHANG Xin-Zhou, ZHOU Min-Qi. Fault tolerance recovery techniques in large distributed parallel computing system[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 207-215. doi: 10.3969/j.issn.1000-5641.2014.05.018
Citation:
ZHANG Xin-Zhou, ZHOU Min-Qi. Fault tolerance recovery techniques in large distributed parallel computing system[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 207-215. doi: 10.3969/j.issn.1000-5641.2014.05.018
ZHANG Xin-Zhou, ZHOU Min-Qi. Fault tolerance recovery techniques in large distributed parallel computing system[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 207-215. doi: 10.3969/j.issn.1000-5641.2014.05.018
Citation:
ZHANG Xin-Zhou, ZHOU Min-Qi. Fault tolerance recovery techniques in large distributed parallel computing system[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 207-215. doi: 10.3969/j.issn.1000-5641.2014.05.018
Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately,any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures,many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault tolerance techniques.