中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

线性驱动的分布式数据库容错性自动化测试

沈靖 蔡鹏

沈靖, 蔡鹏. 线性驱动的分布式数据库容错性自动化测试[J]. 华东师范大学学报(自然科学版), 2020, (4): 98-107. doi: 10.3969/j.issn.1000-5641.201921011
引用本文: 沈靖, 蔡鹏. 线性驱动的分布式数据库容错性自动化测试[J]. 华东师范大学学报(自然科学版), 2020, (4): 98-107. doi: 10.3969/j.issn.1000-5641.201921011
SHEN Jing, CAI Peng. Lineage-driven distributed database system with an automated fault injection test tool[J]. Journal of East China Normal University (Natural Sciences), 2020, (4): 98-107. doi: 10.3969/j.issn.1000-5641.201921011
Citation: SHEN Jing, CAI Peng. Lineage-driven distributed database system with an automated fault injection test tool[J]. Journal of East China Normal University (Natural Sciences), 2020, (4): 98-107. doi: 10.3969/j.issn.1000-5641.201921011

线性驱动的分布式数据库容错性自动化测试

doi: 10.3969/j.issn.1000-5641.201921011
基金项目: 国家重点研发计划(2018YFB1003404); 国家自然科学基金(61972149)
详细信息
    通讯作者:

    蔡 鹏, 男, 副教授, 硕士生导师, 研究方向为高性能事务处理、高可用机制. E-mail: pcai@dase.ecnu.edu.cn

  • 中图分类号: TP392

Lineage-driven distributed database system with an automated fault injection test tool

  • 摘要: 大规模的分布式数据库中, 诸如网络分区、信息丢失、节点宕机等软硬件故障无法避免. 为了提高分布式数据库的可靠性、验证容错协议的正确性, 分布式数据库应定期进行故障注入测试, 即在系统运行过程中人为引发故障. 然而各种故障的组合空间太大, 无法枚举. 已有的测试方法: 一类是随机式故障组合, 其实现方法简单但不能保证探索了所有的故障组合; 另一类是通过专业知识分析系统构成并设计的故障组合, 其测试结果更加完善但不具备普及性. 以线性数据驱动的故障注入测试LDFI (Lineage-Driven Fault Injection)为原型, 在分布式数据库的基础上,实现了一种同时具有完备性和普及性的自动化故障注入测试工具. 实验结果表明, 该测试工具能够以更少的测试案例, 发现随机式故障注入无法发现的复合故障组合所引起的系统漏洞(bug), 提高了数据库的可信度.
  • 图  1  简单传递

    Fig.  1  Simple delivery

    图  2  LDFI流程图

    Fig.  2  LDFI flow chart

    图  3  Cedar的3群集架构

    Fig.  3  Cedar three-cluster architecture

    图  4  Cedar读取数据流程

    Fig.  4  Process of reading data

    图  5  测试用例数量

    Fig.  5  Number of test cases

    图  6  容错性协议的冗余操作

    Fig.  6  Redundant operation of fault tolerant protocols

    图  7  Ups校验码不连续问题

    Fig.  7  The Ups checksum is discontinuous

  • [1] BAILIS P, KINGSBURY K. The network is reliable [J]. Communications of the ACM, 2014, 57(9): 48-55. DOI:  10.1145/2643130.
    [2] DEAN J. Designs, lessons and advice from building large distributed systems [R/OL]. The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware.(2009-10-10)[2019-08-07]. http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf.
    [3] ALSBERG P A, DAY J D. A principle for resilient sharing of distributed resources [C]//Proceedings of the 2nd International Conference on Software Engineering. IEEE, 1976: 562-570.
    [4] STONEBRAKER M. Concurrency control and consistency of multiple copies of data in distributed INGRES [J]. IEEE Transactions on Software Engineering, 1979(3): 188-194. DOI:  10.1109/TSE.1979.234180.
    [5] GARCIA-MOLINA H. Elections in a distributed computing system [J]. IEEE Transactions on Computers, 1982(1): 48-59.
    [6] LEESATAPORNWONGSA T, LUKMAN J F, LU S, et al. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems [J]. ACM SIGPLAN Notices, 2016, 51(4): 517-530. DOI:  10.1145/2954679.2872374.
    [7] GAO Y, DOU W S, QIN F, et al. An empirical study on crash recovery bugs in large-scale distributed systems [C]//Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 2018: 539-550.
    [8] FONSECA P, ZHANG K, WANG X, et al. An empirical study on the correctness of formally verified distributed systems [C]//Proceedings of the 12th European Conference on Computer Systems. ACM, 2017: 328-343.
    [9] GANESAN A, ALAGAPPAN R, ARPACI-DUSSEAU A C, et al. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions [J]. ACM Transactions on Storage, 2017, 13(3): Article 20. DOI:  10.1145/3125497.
    [10] DAWSON S, JAHANIAN F, MITTON T. ORCHESTRA: A fault injection environment for distributed systems [C]// Proceedings of the 26th International Symposium on Fault Tolerant Computing. IEEE, 1996: 404-414.
    [11] GUNAWI H S, DO T, JOSHI P, et al. FATE and DESTINI: A framework for cloud recovery testing [C]// Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. ACM, 2011: 238-252.
    [12] KANAWATI G A, KANAWATI N A, ABRAHAM J A. FERRARI: A flexible software-based fault and error injection system [J]. IEEE Transactions on Computers, 1995, 44(2): 248-260. DOI:  10.1109/12.364536.
    [13] LEESATAPORNWONGSA T, HAO M Z, JOSHI P, et al. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems [C]// Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association. 2014: 399-414.
    [14] BASIRI A, BEHNAM N, DE ROOIJ R, et al. Chaos engineering [J]. IEEE Software, 2016, 33(3): 35-41. DOI:  10.1109/MS.2016.60.
    [15] ALVARO P, ROSEN J, HELLERSTEIN J M. Lineage-driven fault injection [C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 331-346.
    [16] ALVARO P, MARCZAK W R, CONWAY N, et al. Dedalus: Datalog in time and space [C]//International Datalog 2.0 Workshop Datalog 2.0 2010: Datalog Reloaded. Berlin: Springer, 2010: 262-281.
    [17] BUNEMAN P, KHANNA S, TAN W C. Why and where: A characterization of data provenance [C]//International Conference on Database Theory. Berlin: Springer, 2001: 316-330.
    [18] CUI Y, WIDOM J, WIENER J L. Tracing the lineage of view data in awarehousing environment [J]. ACM Transactions on Database Systems (TODS), 2000, 25(2): 179-227. DOI:  10.1145/357775.357777.
    [19] ALVARO P, CONWAY N, HELLERSTEIN J M, et al. Consistency analysis in bloom: A CALM and collected approach [C]// 5th Biennial Conference on Innovative Data Systems Research (CIDR ’11). 2011: 249-260.
    [20] LUKMAN J F, KE H, STUARDO C A, et al. FlyMC: Highly scalable testing of complex interleavings in distributed systems [C]//Proceedings of the 14th EuroSys Conference 2019. ACM, 2019: Article number 20.
    [21] MAJUMDAR R, NIKSIC F. Why is random testing effective for partition tolerance bugs? [J]. Proceedings of the ACM on Programming Languages, 2018, 2(POPL): Article number 46. DOI:  10.1145/3158134.
    [22] LAMPORT L. The part-time parliament [J]. ACM Transactions on Computer Systems (TOCS), 1998, 16(2): 133-169. DOI:  10.1145/279227.279229.
    [23] ONGARO D, OUSTERHOUT J. In search of an understandable consensus algorithm [C]// Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, 2014: 305-320.
  • 加载中
图(7)
计量
  • 文章访问数:  143
  • HTML全文浏览量:  87
  • PDF下载量:  13
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-08-22
  • 网络出版日期:  2020-07-20
  • 刊出日期:  2020-07-25

目录

    /

    返回文章
    返回