Technology and implementation of a new OLTP system
-
摘要: 自20世纪70年代以来,硬件已经得到了巨大的发展,高性能服务器大多数配备TB级的容量、数十个物理核.然而,传统的事务型系统仍旧是基于磁盘存储,运行在物理核数较少的硬件环境上,无法有效地、充分地利用新硬件的运算能力.另一方面,随着互联网的发展,应用对事务型系统的性能有了更高的要求.部分应用在极端情况下需要服务百万甚至千万的并发访问,然而传统的数据库系统并不能支撑这些高并发、高吞吐率的应用.因此,在高性能硬件上重新设计与实现事务型数据库系统已成为重要的研究热点.本文将重点介绍在大内存、多核环境下,事务型数据库系统在各个方面最新的研究工作,并结合开源数据库系统OceanBase,综合介绍新型OLTP(on-lineanalytical processing)系统的设计.Abstract: Since the 1970s, there has been considerable progress in hardware development; in particular, high-performance servers are now equipped with TB-level memory capacity and dozens of physical cores. Traditional OLTP systems, however, are still based on disk storage and designed for hardware with a small number of physical cores; hence, these systems are unable to effectively and fully exploit the computing power offered by new hardware. With the development of the Internet, applications commonly have high performance requirements for transactional systems. In extreme cases, some applications service millions of concurrent access requests, which traditional database systems cannot satisfy. Hence, the redesign and implementation of a transactional database system on high performance hardware has become an important research topic. In this study, we focused on recent work on transaction database systems on large memory and multi-core environments. We used OceanBase, an open source database developed by Alibaba, as an example to analyze the design of a new OLTP system.
-
Key words:
- transaction processing /
- concurrency control /
- log and recovery /
- multi-core scalability
-
表 1 不同并发控制协议优化性能对比
Tab. 1 Comparison of optimization performance of different concurrency control protocols
优化手段 提升多核扩展性 降低维护开销 优化集中式锁表 $\surd $ 减少集中式锁表访问 $\surd $ 构建线程局部的锁表 $\surd $ 简化锁与锁表 $\surd $ $\surd $ 优化版本号的分配 $\surd $ 表 2 日志优化策略影响
Tab. 2 Impact of log optimization strategy
降低磁盘I/O对性能的影响 避免提交过程中占据锁 减少提交导致的上下文切换 减少对日志缓冲区的冲突访问 成组提交与异步交 $\surd $ $\surd $ $\surd $ 日志持久化前释放锁 $\surd $ 流水线式提交 $\surd $ 可扩展的日志缓冲区 $\surd $ 基于NVM的分布式日志 $\surd $ 逻辑日志 $\surd $ $\surd $ $\surd $ -
[1] GRAY J, REUTER A. Transaction Processing:Concepts and Techniques[M]. San Francisco:Margan Kaufmann, 2007. [2] BAYER R, MCCREIGHT E M. Organization and maintenance of large ordered indexes[J]. Acta Informatica, 1972, 1(3):173-189. doi: 10.1007/BF00288683 [3] AGRAWAL D, BERNSTEIN A J, GUPTA P, et al. Distributed optimistic concurrency control with reduced rollback[J]. Distributed Computing, 1987, 2(1):45-59. doi: 10.1007/BF01786254 [4] BERNSTEIN P A, HADZILACOS V, GOODMAN N. Concurrency Control and Recovery in Database Systems[M]. MA:Addison-Wesley Publishing, 1987. [5] FREEDMAN C S, ISMERT E, LARSON P, et al. Compilation in the Microsoft SQL Server Hekaton Engine[J]. IEEE Data(base) Engineering Bulletin, 2014:22-30. http://www.zentralblatt-math.org/ioport/en/?q=an%3A10383560 [6] DEWITT D J, KATZ R H, OLKEN F, et al. Implementation techniques for main memory database systems[J]. Acm Sigmod Record, 1984, 14(2):1-8. doi: 10.1145/971697 [7] MOHAN C, HADERLE D, LINDSAY B, et al. ARIES:A transaction recovery method supporting finegranularity locking and partial rollbacks using write-ahead logging[J]. Acm Transactions on Database Systems, 1992, 17(1):94-162. doi: 10.1145/128765.128770 [8] COBURN J, BUNKER T, SCHWARZ M, et al. From ARIES to MARS: Transaction support for next-generation, solid-state drives[C]//Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013: 197-212. http://cseweb.ucsd.edu/~swanson/papers/SOSP2013-MARS.pdf [9] 阳振坤. OceanBase关系数据库架构[J].华东师范大学学报(自然科学版), 2014(5):141-148. doi: 10.3969/j.issn.1000-5641.2014.05.012 [10] DIACONU C, FREEDMAN C, ISMERT E, et al. Hekaton: SQL server's memory-optimized OLTP engine[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2013: 1243-1254. https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf [11] HARIZOPOULOS S, ABADI D J, MADDEN S, et al. OLTP through the looking glass, and what we found there[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2008: 981-992. http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf [12] WANG T, JOHNSON R. Scalable logging through emerging non-volatile memory[J]. Proceedings of the Vldb Endowment, 2014, 7(10):865-876. doi: 10.14778/2732951 [13] JOHNSON R, PANDIS I, AILAMAKI A. Improving OLTP scalability using speculative lock inheritance[J]. Proceedings of the Vldb Endowment, 2009, 2(1):479-489. doi: 10.14778/1687627 [14] TU S, ZHENG W, KOHLER E, et al. Speedy transactions in multicore in-memory databases[C]//TwentyFourth ACM Symposium on Operating Systems Principles. ACM, 2013: 18-32. http://people.csail.mit.edu/stephentu/papers/silo.pdf [15] BLANAS S, DIACONU C, FREEDMAN C, et al. High-performance concurrency control mechanisms for mainmemory databases[J]. Proceedings of the Vldb Endowment, 2011, 5(4):298-309. doi: 10.14778/2095686 [16] THOMASIAN A. Two-phase locking performance and its thrashing behavior[J]. ACM Transactions on Database Systems, 1993, 18(4):579-625. doi: 10.1145/169725.169720 [17] KUNG H T. On optimistic methods for concurrency control[J]. Acm Transactions on Database Systems, 1981, 6(2):213-226. doi: 10.1145/319566.319567 [18] SADOGHI M, CANIM M, BHATTACHARJEE B, et al. Reducing database locking contention through multiversion concurrency[J]. Proceedings of the Vldb Endowment, 2014, 7(13):1331-1342. doi: 10.14778/2733004 [19] REN K, THOMSON A, ABADI D J. Lightweight locking for main memory database systems[C]//International Conference on Very Large Data Bases. VLDB Endowment, 2012: 145-156. http://cs-www.cs.yale.edu/homes/dna/papers/vll-vldb13.pdf [20] YU X. An evaluation of concurrency control with one thousand cores[D]. Boston: Massachusetts Institute of Technology, 2015. [21] PANDIS I, JOHNSON R, HARDAVELLAS N, et al. Data-oriented transaction execution[J]. Proceedings of the Vldb Endowment, 2010, 3(1/2):928-939. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0213864926/ [22] THOMSON A, THOMSON A, ABADI D J. An evaluation of the advantages and disadvantages of deterministic database systems[J]. Proceedings of the Vldb Endowment, 2014, 7(10):821-832. doi: 10.14778/2732951 [23] PAVLO A, CURINO C, ZDONIK S B, et al. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems[C]//International conference on management of data, 2012: 61-72. http://hstore.cs.brown.edu/papers/hstore-partitioning.pdf [24] GOTTEMUKKALA V, LEHMAN T J. Locking and latching in a memory-resident database system[C]//Intl Conf on Very Large Databases, 1992: 533-544. http://www.vldb.org/conf/1992/P533.PDF [25] HELLAND P, SAMMER H, LYON J, et al. Group Commit Timers and High Volume Transaction Systems[C]//High performance transaction systems workshop, 1987: 301-329. doi: 10.1007%2F3-540-51085-0_52 [26] JOHNSON R, PANDIS I, STOICA R, et al. Aether:a scalable approach to logging[J]. Proceedings of the Vldb Endowment, 2010, 3(1/2):681-692. http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ027591077/ [27] ALPERN D, ARORA G, BARCLAY C, et al. Oracle Database Advanced Application Developer's Guide, 11g Release 2(11.2) E17125-06[R/OL].[2018-07-10]. https://docs.oracle.com/cd/E1188201/appdev.112/e41502/toc.htm. [28] SOISALON-SOININEN E, YLÖNEN T. Partial strictness in two-phase locking[C]//International Conference on Database Theory. Springer-Verlag, 1995: 139-147. doi: 10.1007%2F3-540-58907-4_12 [29] MALVIYA N, WEISBERG A, MADDEN S, et al. Rethinking main memory OLTP recovery[C]//International Conference on Data Engineering. IEEE, 2014: 604-615. http://hstore.cs.brown.edu/papers/voltdb-recovery.pdf [30] STOICA R, LEVANDOSKI J J, LARSON P A. Identifying hot and cold data in main-memory databases[C]//International Conference on Data Engineering. IEEE, 2013: 26-37. https://www.microsoft.com/en-us/research/wp-content/uploads/2013/04/ColdDataClassification-icde2013-cr.pdf [31] ELDAWY A, LEVANDOSKI J, LARSON P Å. Trekking through Siberia:Managing cold data in a memoryoptimized database[J]. Proceedings of the Vldb Endowment, 2014, 7(11):931-942. doi: 10.14778/2732967 [32] DEBRABANT J, PAVLO A, TU S, et al. Anti-caching:A new approach to database management system architecture[J]. Proceedings of the Vldb Endowment, 2013, 6(14):1942-1953. doi: 10.14778/2556549 [33] THOMSON A, DIAMOND T, WENG S C, et al. Calvin: Fast distributed transactions for partitioned database systems[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2012: 1-12. http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.pdf