中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

Message Board

Respected readers, authors and reviewers, you can add comments to this page on any questions about the contribution, review, editing and publication of this journal. We will give you an answer as soon as possible. Thank you for your support!

Name
E-mail
Phone
Title
Content
Verification Code
Issue 5
Dec.  2019
Turn off MathJax
Article Contents
DING Guo-hao, XU Chen, QIAN Wei-ning. Efficient data loading for log-structured data stores[J]. Journal of East China Normal University (Natural Sciences), 2019, (5): 143-158. doi: 10.3969/j.issn.1000-5641.2019.05.012
Citation: DING Guo-hao, XU Chen, QIAN Wei-ning. Efficient data loading for log-structured data stores[J]. Journal of East China Normal University (Natural Sciences), 2019, (5): 143-158. doi: 10.3969/j.issn.1000-5641.2019.05.012

Efficient data loading for log-structured data stores

doi: 10.3969/j.issn.1000-5641.2019.05.012
  • Received Date: 2019-07-28
  • Publish Date: 2019-09-25
  • With the rapid development of Internet technology in recent years, the number of users and the data processed by Internet companies and traditional financial institutions are growing rapidly. Traditionally, businesses have tackled this scalability problem by adding servers and adopting methods based on database sharding; however, this can lead to significant manual maintenance expenses and hardware overhead. To reduce overhead and the problems caused by database sharding, businesses commonly replace heritage equipment with new database systems. In this context, new databases based on log-structured merge tree storage (such as OceanBase) are being widely used; the data blocks stored on the disks of such systems exhibit global orderly features. In the process of switching from a traditional database to a new database, a large amount of data must be transferred, and database node downtime may occur when there is extended loading. In order to reduce the total time for loading and failure recovery, we propose a data loading method that supports load balancing and efficient fault tolerance. To support balanced data loading, we pre-calculate the number of partitions based on the file size and the default block size of a target system rather than using a pre-determined number of partitions. In addition, we use the feature that the data exported from the sharding database is usually sorted to determine the split points between partitions by selecting partial sampling blocks and selecting samples in sampling blocks at equal intervals, avoiding the high overhead caused by global sampling and selecting samples randomly or at the head in sampling blocks. To speed up the recovery process, we propose a replica-based partial recovery to avoid restart-based complete reloading; this method uses the multi-replica of an LSM-tree system to reduce the amount of reloaded data. Experimental results show that by pre-calculating the number of partitions and partial sampling blocks and by using equal-interval sampling, we can accelerate data loading relative to pre-determining the number of partitions and global sampling blocks as well as relative to random or head sampling strategies. Hence, we demonstrate the efficiency of replica-based partial failure recovery compared to restart-based complete reloading.
  • loading
  • [1]
    O'NEIL P, CHENG E, GAWLICK D, et al. The log-structured merge-tree (LSM-tree)[J]. Acta Informatica, 1996, 33(4):351-385. doi:  10.1007/s002360050048
    [2]
    CHANG F, DEAN J, GHEMAWAT S, et al. Bigtable: A distributed storage system for structured data[C]//OSDI'06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation-Volume 7. 2006: 205-218.
    [3]
    LevelDB[EB/OL].[2019-06-09]. https://github.com/google/leveldb.
    [4]
    Hbase[EB/OL].[2019-06-09]. http://hbase.apache.org/.
    [5]
    OceanBase[EB/OL].[2019-06-09]. https://github.com/alibaba/oceanbase/.
    [6]
    TiDB[EB/OL].[2019-06-09]. https://university.pingcap.com/
    [7]
    COOPER B F, RAMAKRISHNAN R, SRIVASTAVA U, et al. PNUTS:Yahoo!'s hosted data serving platform[J]. Proceedings of the VLDB Endowment, 2008, 1(2):1277-1288. doi:  10.14778/1454159.1454167
    [8]
    SILBERSTEIN A, COOPER B F, SRIVASTAVA U, et al. Efficient bulk insertion into a distributed ordered table[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 2008: 765-778.
    [9]
    SILBERSTEIN A E, SEARS R, ZHOU W, et al. A batch of PNUTS: Experiences connecting cloud batch and serving systems[C]//Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011: 1101-1112.
    [10]
    Hadoop[EB/OL].[2019-06-09]. https://hadoop.apache.org/
    [11]
    Cassandra[EB/OL].[2019-06-09]. http://cassandra.apache.org/.
    [12]
    CDEAR[EB/OL].[2019-06-09]. https://github.com/daseECNU/Cedar/.
    [13]
    AZQUETA-ALZÚAZ A, PATIÑO-MARTINEZ M, BRONDINO I, et al. Massive data load on distributed database systems over HBase[C]//Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2017: 776-779.
    [14]
    DEAN J, GHEMAWAT S. MapReduce:Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113. doi:  10.1145/1327452.1327492
    [15]
    BARCLAY T, BARNES R, GRAY J, et al. Loading databases using dataflow parallelism[J]. ACM Sigmod Record, 1994, 23(4):72-83. doi:  10.1145/190627.190647
    [16]
    RAMAKRISHNAN S R, SWART G, URMANOV A. Balancing reducer skew in MapReduce workloads using progressive sampling[C]//Proceedings of the 3rd ACM Symposium on Cloud Computing. ACM, 2012: Article No.16.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(4)

    Article views (101) PDF downloads(0) Cited by()
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return