Fault tolerance recovery techniques in large distributed parallel computing system
-
摘要: 当前,拥有超级计算能力的计算机系统通常是大型商用系统形成计算机集群.与所有的分布式系统一样,这些系统通过独立的计算机硬件协同合作共同实现超级计算的能力.然而在拥有超级计算能力的同时,集群中的任何一个组件随时都可能失效,从而导致错的输出.为了提高集群在系统出现故障的情况下的鲁棒性,许多容错技术已经被设计和实现,用以处理各种类型的系统故障.本文对各种现有的容错技术进行了总结归纳,以便在此基础之上进行进一步的研究从而适应当前环境下的系统容错.Abstract: Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately,any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures,many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault tolerance techniques.
-
Key words:
- fault tolerance /
- parallel computing system /
- cluster
点击查看大图
计量
- 文章访问数: 1066
- HTML全文浏览量: 17
- PDF下载量: 2062
- 被引次数: 0