Storage and load balancing for large-scale comment data on heterogeneous Redis cluster
-
摘要: 大规模评论数据的存储与查询性能对构建于其上的各类应用的快速响应具有重要影响.同时,异构计算环境中各计算节点性能呈现差异,如何充分开采各节点的计算和存储性能,优化大规模评论数据的存储与查询性能,是一个关键挑战.基于Redis集群的数据管理优势,首先提出了一种同构环境下基于卡槽存储平衡的大规模评论数据存储模型;然后论证了卡槽数目与节点查询效率的关系,以"负载与访问性能相平衡"的原则分配卡槽,进一步设计了异构环境下的集群节点负载计算和存储分配方法,充分开采了异构Redis集群中不同节点的性能.实验结果表明,提出的存储模型具有很好的存储平衡效果,提升了集群的整体查询效率.Abstract: The storage and query performance for large-scale comment data have a great influence on those applications built on the above data. In a heterogeneous computing environment, each node has different performance on storage and computation, it presents a key challenge for optimizing the storage and query performance for large-scale comment data by taking full advantage of the performance of each node. Based on the ability of Redis cluster, we design a storage model for large-scale comment data in a homogeneous Redis cluster, which provides the storage balancing in Redis slots. And then, we discuss the relationship between the number of Redis slots and query efficiency to design a method for allocating storage on the real load of each computing node for heterogeneous Redis clusters, which can make full use of the performance of each node and can guide to allocate slots to nodes by balancing the query performance and storage loading. Our experimental results show that the proposed model has a good effect on storage loading and improve the query efficiency of the heterogeneous Redis cluster.
-
Key words:
- large-scale comment data /
- storage and load balancing /
- query optimization
-
表 1 评论数据二级索引结构
Tab. 1 Two-level index for comment data
键名 值 排序值 值内容 ItemID StartTime ItemID: Number 表 2 评论数据存储结构
Tab. 2 Storage structure for comment data
键名 值 排序值 值内容 ItemID: Number Timestamp UserID: comment 表 3 基于用户ID的辅助索引结构
Tab. 3 A secondary index on UserID
键名 值 UserID ItemID: Number: Timestamp, ItemID: Number: Timestamp, ${\cdots}$ 表 4 存储平衡分割参数(LineNum)的测试结果
Tab. 4 Experimental results for parameter(LineNum) of storage partition
分割参数(LineNum)/万 节点1/M 节点2/M 节点3/M 标准差 1 502.40 501.26 460.20 19.63 2 495.12 489.26 471.28 10.14 3 509.81 486.51 421.12 37.54 4 486.44 526.01 433.86 37.74 5 557.57 414.99 450.43 60.61 6 644.51 362.38 410.35 123.26 7 664.95 376.70 384.30 134.13 8 661.56 400.65 373.29 129.92 9 648.79 469.65 339.60 126.76 10 602.16 471.74 341.24 106.52 表 5 卡槽转移后存储数据比例
Tab. 5 Ratios after shifting slots
项目 节点(1核、2核、4核) 比例值 键值 235、299、620 1: 1.272: 2.638 卡槽转移后存储容量 301.92 MB、382.43 MB、777.58 MB 1: 1.266: 2.639 卡槽数量 3 449、4 000、8 935 1: 1.159: 2.590 表 6 查询数据表
Tab. 6 Data fact for testing queries
商品ID 评论数目/条 容量/M 1 117 908 27.1 2 58 182 15.1 3 212 708 40.5 4 85 236 11.4 5 104 352 9.07 6 93 431 8.73 7 53 937 4.65 8 26 439 2.60 9 1 691 1.12 10 985 0.508 表 7 范围查询测试结果
Tab. 7 The experimental results for queries
查询范围项目 卡槽移动前/s 卡槽移动后/s 速度提高率/% 1日 0.022 64 0.018 35 23.4 1月 0.278 66 0.223 21 24.8 半年 1.112 17 1.032 61 7.71 1年 1.944 21 1.833 71 6.00 1年半 2.404 70 2.168 68 10.9 -
[1] INTEL. A yearly product cadence moves the industry forward in a predictable fashion that can be planned in advance[EB/OL].[2017-05-10]. https://www.intel.com/content/www/us/en/silicon-innovations/intel-tock-modelgeneral.html. [2] CHANG F, DEAN J, GHEMAWAT S. et al. Bigtable:A distributed storage system for structured data[J]. ACM Transactions on Computer Systems, 2006, 26(2):205-218. [3] BORTHAKUR D. The Hadoop distributed file system:Achitecture and design[EB/OL].[2017-06-02]. http://hadoop.apache.org/common/docs/r0.180/hdfsdesign.pdf. [4] 申德荣, 于戈, 王习特, 等.支持大数据管理的NoSQL系统研究综述[J].软件学报, 2013(8):1786-1803. http://www.cnki.com.cn/Article/CJFDTOTAL-RJXB201308008.htm [5] 何亚农, 宋玮, 赵跃龙.基于平衡结构的对等网络存储系统研究[J].计算机工程与设计, 2011, 32(8):2611-2613. http://www.cnki.com.cn/Article/CJFDTOTAL-SJSJ201108014.htm [6] KALA K A, CHITHARANJAN K. Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop-HDFS-An infrastructure extension[C]//International Conference on Circuits, Power and Computing Technologies. IEEE, 2013:1243-1249. [7] ROWSTRON A, DRUSCHEL P. Storage management and caching in PAST, a large-scale, persistent peer-topeer storage utility[C]//Proceedings of the 18th ACM Symposium on Operating Systems Principles. ACM, 2001:188-201. [8] OKCAN A, RIEDEWALD M. Processing theta-joins using MapReduce[C]//Proceedings of SIGMOD International Conference on Management of Data. ACM, 2011:949-960. [9] WEI Q, VEERAVALLI B, GONG B, et al. CDRM:A cost-effective dynamic replication management scheme for cloud storage cluster[C]//IEEE International Conference on CLUSTER Computing. 2010:188-196. [10] XIE C, CAI B. A decentralized storage cluster with high reliability and flexibility[C]//Proceedings of 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE, 2006:1-8.