Application of the consistency protocol in distributed database systems
-
摘要: 近年来分布式数据库产品层出不穷,但分布式数据库较于单机数据库更复杂,为了让系统可用,设计者需要采用一致性协议来保证分布式数据库系统中的可用性和一致性这两个重要特性.保证一致性需要使用一致性协议为并发的事务更新操作确定一个全局的执行顺序,并协调局部状态和全局状态不断的达成动态一致;保证可用性需要一致性协议协调多副本之间的一致来实现主备节点的无缝切换.因此分布式一致性协议是实现分布式数据库系统的重要基础.详细介绍了经典的分布式一致性协议以及在目前常见的几种分布式数据库系统中一致性协议的应用,并从读写操作、节点类型与网络通信等方面进行对比分析.Abstract: In recent years, many distributed database products have emerged in the market; yet, distributed databases are still more complex than centralized databases. In order to make the system useable, designers need to adopt the consistency protocol to ensure two important features of distributed database systems:availability and consistency. The protocol ensures consistency by determining the global execution order of operations for concurrent transactions and by coordinating local and global states to achieve continuous dynamic agreement; The consistency protocol ensures availability by coordinating consistency between multiple copies to achieve seamless switching between master and standby nodes. Hence, the distributed consensus protocol is the fundamental basis for the distributed database system. This paper reviews, in detail, the classic distributed consistency protocol and the application of the consistency protocol to current mature distributed database. The study also provides analysis and a comparison between the two approaches considering factors like read-write operation, node type, and network communication.
-
Key words:
- distributed database /
- distributed consistency protocol /
- consistency /
- availability
-
表 1 Basic Paxos与基于leader的Multi Paxos的区别
Tab. 1 Differences between Basic Paxos and Multi Paxos based on leader
Basic Paxos Multi Paxos 达成一致性的流程可分为Prepare Request和Accept Request两阶段. 选主运行一次Basic Paxos. leader任期内达成一致性只需执行AcceptRequest阶段即可. 无leader, 所有proposer都可以发起议案, 容易造成复杂情况, 性能低. 只有leader才能发起议案, 逻辑简单, 性能高, 存在单点故障问题. proposer只有在收到多数派PrepareRequest回复成功的消息时, 才能确定议案中的值. Leader可以直接可以确定议案中的值. 达成一致性至少需要两轮网络I/O, 延迟高 需要网络I/O较少, 延迟低 表 2 节点日志不一致现象及原因分析
Tab. 2 Occurances and reasons for inconsistencies of logs between nodes
日志不一致现象 部分原因 follower没有leader的某些日志. 旧leader还未将全部日志复制到该follower就宕机, 随后迅速重启又当选为新leader; 或者follower宕机后重启, leader正常运行. leader没有follower的某些日志. leader在前几个任期一直处于宕机状态, 在当前任期重启后当选为leader, 而follower一直正常运行. follower没有leader的某些日志, leader又没有follower的某些日志. leader在还没有将自己的全部日志复制到该follower就宕机, 随后一直处于宕机状态, 经过几个任期后重启并当选为新的leader. 表 3 分布式一致性算法对比分析
Tab. 3 Comparsion and analysis of distributed consistency algorithms
Quorum Basic Paxos Raft 主要思想 多数派思想 多数派思想 多数派思想 选主阶段 可以有多个leader 可以有多个leader, leader不需要拥有全部已提交的日志. 强调leader的地位(唯一性), leader拥有全部已提交的日志. 日志复制阶段 允许日志空洞 没有日志空洞, 保持日志的连续性. 优点 可根据实际需求配置不同的NWR参数实现不同级别的一致性. 可实现强一致性, 能容忍少于半数节点故障, 可用性较高. 可实现强一致性, 逻辑简单, 可用性较高, 工程实现难度较低. 缺点 不能保证多副本之间更新操作执行顺序的一致性. 多个proposer发起议案可能导致复制情况较多, 工程实现难度较大. 受网络分区影响可能出现"频繁选举"和"双主"问题. 适用场景 对一致性要求不高的场景. 对一致性要求较高, 多节点间网络通信良好的场景. 无网络分区的场景. 表 4 采用一致性协议的分布式数据库对比分析
Tab. 4 Comparsion and analysis of consistency algorithms in distributed database
Chubby XPaxos Meagstore Spanner PaxosStore Cassandra 设计目标 提供分布式锁服务. 高性能分布式强一致的Paxos库. 满足交互式在线服务需求存储系统. 多版本、全球分布式同步复制数据库. 支持混合存储的通用存储系统. KEY-VALUE类型的NOSQL数据库 读写操作 由唯一主节点处理读写操作, 如Chubby.
由唯一主节点处理写操作, 其他节点可以处理读操作.如Spanner、XPaxos.
放弃唯一主节点处理读写操作的方法, 任意节点均可能处理读写操作.
的NOSQL数据库节点类型 用状态机维护节点状态, 节点为全功能节点, 如Chubby.
用状态机维护节点状态, 节点可以是定制功能节点, 如XPaxos、Meagstore、Spanner.
放弃使用状态机描述节点, 各节点保存数据在所有节点上副本的相关信息,
如PaxosStore、Cassandra节点通信 节点间通信包括PrepareRequest、AcceptRequest等多种类型的消息以及多种
消息处理模块.状态描述较复杂.如Chubby、XPaxos、Meagstore、Spanner.
采用统一的通信消息类型, 即M$_{X\to Y}$=(SX X, SX Y), 统一的消息处理模块.简化了
信息传递与信息处理的过程, 避免复杂的状态描述.如PaxosStore.
采用点对点通讯协议gossip实现节点间通信, 如Cassandra. -
[1] TANENBAUM, MAARTEN VAN STEEN. Distributed Systems Principles and Paradigms[M]. 2nd ed. USA:Pearson, 2001:1-10. [2] BREWER E A. Towards robust distributed systems (abstract)[C]//Nineteenth ACM Symposium on Principles of Distributed Computing. New York: ACM, 2000: 7. https://dl.acm.org/citation.cfm?id=343502 [3] GILBERT S, LYNCH N. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services[J]. Acm Sigact News, 2002, 33(2):51-59. doi: 10.1145/564585 [4] TANENBAUM A S, STEEN M V. Distributed Systems:Principles and Paradigms[M]. Beijing:Tsinghua University Press, 2002. [5] 朱涛, 郭进伟, 周欢, 等.分布式数据库中一致性与可用性的关系[J].软件学报, 2018(1):131-149. http://d.old.wanfangdata.com.cn/Periodical/rjxb201801007 [6] 储佳佳, 郭进伟, 刘柏众, 等.高可用数据库系统中的分布式一致性协议[J].华东师范大学学报(自然科学版), 2016, 2016(5):1-9. doi: 10.3969/j.issn.1000-5641.2016.05.001 [7] GIFFORD D K. Weighted voting for replicated data[C]//Acm Symposium on Operating Systems Principles. New York: ACM, 1979: 150-162. http://lass.cs.umass.edu/~shenoy/courses/spring04/677/readings/gifford.pdf [8] ONGARO D, OUSTERHOUT J K. In search of an understandable consensus algorithm[C]//USENIX Annual Technical Conference. New York: ACM, 2014: 305-319. https://web.stanford.edu/~ouster/cgi-bin/papers/raft-atc14 [9] JUNQUEIRA F P, REED B C, SERAFINI M. Zab: High-performance broadcast for primary-backup systems[C]//International Conference on Dependable Systems & Networks. New York: IEEE, 2011: 245-256. http://www.cs.cornell.edu/courses/cs6452/2012sp/papers/zab-ieee.pdf [10] LAMPORT L. Paxos made simple[J]. ACM Sigact News, 2001, 32(4):18-25. http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0235401123/ [11] CHANDRA T D, GRIESEMER R, REDSTONE J. Paxos made live: an engineering perspective[C]//Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. New York: ACM, 2007: 398-407. http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf [12] LAMPORT L B, MASSA M T. Cheap paxos: U.S. Patent 7, 249, 280[P]. 2007-07-24. [13] LAMPORT L. Fast paxos[J]. Distributed Computing, 2006, 19(2):79-103. doi: 10.1007/s00446-006-0005-x [14] LAMPORT L, MALKHI D, ZHOU L. Vertical paxos and primary-backup replication[C]//Proceedings of the 28th ACM symposium on Principles of distributed computing. New York: ACM, 2009: 312-313. https://www.microsoft.com/en-us/research/publication/vertical-paxos-and-primary-backup-replication/ [15] LAMPORT L, SHOSTAK R, PEASE M. The Byzantine generals problem[J]. ACM Transactions on Programming Languages and Systems (TOPLAS), 1982, 4(3):382-401. doi: 10.1145/357172.357176 [16] 张晨东, 郭进伟, 刘柏众, 等.基于Raft一致性协议的高可用性实现[J].华东师范大学学报(自然科学版), 2015(5):172-184. doi: 10.3969/j.issn.1000-5641.2015.05.015 [17] 庞天泽.可扩展数据管理系统中的高可用实现[D].上海: 华东师范大学, 2016. http://cdmd.cnki.com.cn/Article/CDMD-10269-1016137978.htm [18] FACEBOOK. The Apache Software foundation: Apache Cassandra Documentation v4.0[EB/OL]. (2016-09-01)[2018-04-10] http://cassandra.apache.org/. [19] ZHENG J, LIN Q, XU J, et al. PaxosStore:High-availability storage made practical in WeChat[J]. Proceedings of the VLDB Endowment, 2017, 10(12):1730-1741. doi: 10.14778/3137765 [20] 江疑. X-Paxos: 阿里巴巴的高性能分布式强一致Paxos独立基础库[EB/OL].[2017-08-07]. http://developer.51cto.com/art/201708/547380.htm. [21] BURROWS M. The Chubby lock service for loosely-coupled distributed systems[C]//USENIX Association. Proceedings of the 7th symposium on Operating systems design and implementation. New York: ACM, 2006: 335-350. https://ai.google/research/pubs/pub27897 [22] BAKER J, BOND C, CORBETT J C, et al. Megastore: Providing scalable, highly available storage for interactive services[C]//Biennial Conference on Innovative Data Systems Research. USA: Online Proceedings, 2011(11): 223-234. http://cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf [23] CORBETT J C, DEAN J, EPSTEIN M, et al. Spanner:Google's globally distributed database[J]. ACM Transactions on Computer Systems (TOCS), 2013, 31(3):8. http://d.old.wanfangdata.com.cn/Periodical/txxb200506011