线性驱动的分布式数据库容错性自动化测试

沈靖; 蔡鹏

doi:10.3969/j.issn.1000-5641.201921011

线性驱动的分布式数据库容错性自动化测试

doi: 10.3969/j.issn.1000-5641.201921011

沈靖,
蔡鹏^,

华东师范大学数据科学与工程学院, 上海　200062

基金项目: 国家重点研发计划(2018YFB1003404); 国家自然科学基金(61972149)

详细信息

通讯作者:
蔡　鹏, 男, 副教授, 硕士生导师, 研究方向为高性能事务处理、高可用机制. E-mail: pcai@dase.ecnu.edu.cn

中图分类号: TP392
计量
- 文章访问数: 143
- HTML全文浏览量: 87
- PDF下载量: 13
- 被引次数: 0
出版历程
- 收稿日期: 2019-08-22
- 网络出版日期: 2020-07-20
- 刊出日期: 2020-07-25

Lineage-driven distributed database system with an automated fault injection test tool

SHEN Jing,
CAI Peng^,

School of Data Science and Engineering, East China Normal University, Shanghai　200062, China

摘要

摘要: 大规模的分布式数据库中, 诸如网络分区、信息丢失、节点宕机等软硬件故障无法避免. 为了提高分布式数据库的可靠性、验证容错协议的正确性, 分布式数据库应定期进行故障注入测试, 即在系统运行过程中人为引发故障. 然而各种故障的组合空间太大, 无法枚举. 已有的测试方法: 一类是随机式故障组合, 其实现方法简单但不能保证探索了所有的故障组合; 另一类是通过专业知识分析系统构成并设计的故障组合, 其测试结果更加完善但不具备普及性. 以线性数据驱动的故障注入测试LDFI (Lineage-Driven Fault Injection)为原型, 在分布式数据库的基础上，实现了一种同时具有完备性和普及性的自动化故障注入测试工具. 实验结果表明, 该测试工具能够以更少的测试案例, 发现随机式故障注入无法发现的复合故障组合所引起的系统漏洞(bug), 提高了数据库的可信度.
- 分布式 /
- 线性数据 /
- 自动化 /
- 故障注入
Abstract: Failures are unavoidable in distributed database systems. To improve the fault tolerance of distributed database systems and verify the accuracy of fault-tolerant protocols, the system should periodically run a fault injection test to artificially trigger a fault during system operations. However, the scale and complexity of distributed database systems make it difficult to fully enumerate inputs and make it impractical to explain all the behaviors that occur in the system. One of the test methods commonly used is a random combination of faults, which is simple but not complete; the other one is guided by professional knowledge and is not universally applicable. Accordingly, we adopted and revised a research prototype, called the lineage-driven fault injection (LDFI) test, that is both complete and universally applicable. We implemented the automation fault injection tool in Cedar. Experiments showed that lineage-driven fault injection tests can successfully detect system bugs caused by complex fault combinations and improve the credibility of the database; these bugs cannot be detected by random fault injection with fewer test cases.
- distributed /
- data lineage /
- automation /
- fault injection

HTML全文

图 1 简单传递

Fig. 1 Simple delivery

下载: 全尺寸图片幻灯片

图 2 LDFI流程图

Fig. 2 LDFI flow chart

下载: 全尺寸图片幻灯片

图 3 Cedar的3群集架构

Fig. 3 Cedar three-cluster architecture

下载: 全尺寸图片幻灯片

图 4 Cedar读取数据流程

Fig. 4 Process of reading data

下载: 全尺寸图片幻灯片

图 5 测试用例数量

Fig. 5 Number of test cases

下载: 全尺寸图片幻灯片

图 6 容错性协议的冗余操作

Fig. 6 Redundant operation of fault tolerant protocols

下载: 全尺寸图片幻灯片

图 7 Ups校验码不连续问题

Fig. 7 The Ups checksum is discontinuous

下载: 全尺寸图片幻灯片

参考文献(23)

[1]	BAILIS P, KINGSBURY K. The network is reliable [J]. Communications of the ACM, 2014, 57(9): 48-55. DOI: 10.1145/2643130.
[2]	DEAN J. Designs, lessons and advice from building large distributed systems [R/OL]. The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware.(2009-10-10)[2019-08-07]. http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf.
[3]	ALSBERG P A, DAY J D. A principle for resilient sharing of distributed resources [C]//Proceedings of the 2nd International Conference on Software Engineering. IEEE, 1976: 562-570.
[4]	STONEBRAKER M. Concurrency control and consistency of multiple copies of data in distributed INGRES [J]. IEEE Transactions on Software Engineering, 1979(3): 188-194. DOI: 10.1109/TSE.1979.234180.
[5]	GARCIA-MOLINA H. Elections in a distributed computing system [J]. IEEE Transactions on Computers, 1982(1): 48-59.
[6]	LEESATAPORNWONGSA T, LUKMAN J F, LU S, et al. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems [J]. ACM SIGPLAN Notices, 2016, 51(4): 517-530. DOI: 10.1145/2954679.2872374.
[7]	GAO Y, DOU W S, QIN F, et al. An empirical study on crash recovery bugs in large-scale distributed systems [C]//Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 2018: 539-550.
[8]	FONSECA P, ZHANG K, WANG X, et al. An empirical study on the correctness of formally verified distributed systems [C]//Proceedings of the 12th European Conference on Computer Systems. ACM, 2017: 328-343.
[9]	GANESAN A, ALAGAPPAN R, ARPACI-DUSSEAU A C, et al. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions [J]. ACM Transactions on Storage, 2017, 13(3): Article 20. DOI: 10.1145/3125497.
[10]	DAWSON S, JAHANIAN F, MITTON T. ORCHESTRA: A fault injection environment for distributed systems [C]// Proceedings of the 26th International Symposium on Fault Tolerant Computing. IEEE, 1996: 404-414.
[11]	GUNAWI H S, DO T, JOSHI P, et al. FATE and DESTINI: A framework for cloud recovery testing [C]// Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. ACM, 2011: 238-252.
[12]	KANAWATI G A, KANAWATI N A, ABRAHAM J A. FERRARI: A flexible software-based fault and error injection system [J]. IEEE Transactions on Computers, 1995, 44(2): 248-260. DOI: 10.1109/12.364536.
[13]	LEESATAPORNWONGSA T, HAO M Z, JOSHI P, et al. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems [C]// Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association. 2014: 399-414.
[14]	BASIRI A, BEHNAM N, DE ROOIJ R, et al. Chaos engineering [J]. IEEE Software, 2016, 33(3): 35-41. DOI: 10.1109/MS.2016.60.
[15]	ALVARO P, ROSEN J, HELLERSTEIN J M. Lineage-driven fault injection [C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 331-346.
[16]	ALVARO P, MARCZAK W R, CONWAY N, et al. Dedalus: Datalog in time and space [C]//International Datalog 2.0 Workshop Datalog 2.0 2010: Datalog Reloaded. Berlin: Springer, 2010: 262-281.
[17]	BUNEMAN P, KHANNA S, TAN W C. Why and where: A characterization of data provenance [C]//International Conference on Database Theory. Berlin: Springer, 2001: 316-330.
[18]	CUI Y, WIDOM J, WIENER J L. Tracing the lineage of view data in awarehousing environment [J]. ACM Transactions on Database Systems (TODS), 2000, 25(2): 179-227. DOI: 10.1145/357775.357777.
[19]	ALVARO P, CONWAY N, HELLERSTEIN J M, et al. Consistency analysis in bloom: A CALM and collected approach [C]// 5th Biennial Conference on Innovative Data Systems Research (CIDR ’11). 2011: 249-260.
[20]	LUKMAN J F, KE H, STUARDO C A, et al. FlyMC: Highly scalable testing of complex interleavings in distributed systems [C]//Proceedings of the 14th EuroSys Conference 2019. ACM, 2019: Article number 20.
[21]	MAJUMDAR R, NIKSIC F. Why is random testing effective for partition tolerance bugs? [J]. Proceedings of the ACM on Programming Languages, 2018, 2(POPL): Article number 46. DOI: 10.1145/3158134.
[22]	LAMPORT L. The part-time parliament [J]. ACM Transactions on Computer Systems (TOCS), 1998, 16(2): 133-169. DOI: 10.1145/279227.279229.
[23]	ONGARO D, OUSTERHOUT J. In search of an understandable consensus algorithm [C]// Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, 2014: 305-320.