Data calculation and performance optimization of dairy traceability based on Hadoop/Hive
-
摘要: 为了提升传统乳制品溯源系统应对大规模企业生产数据的性能,本文分析了乳制品相关企业供应链业务流程、关键溯源单元和溯源信息,结合Hadoop/Hive大数据技术和分布式数据库技术,设计并构建了基于Hadoop/Hive的乳制品溯源框架.搭建模拟大数据环境并使用实际生产数据对系统性能进行测试,实验结果表明,引入Hadoop/Hive技术后,系统的平均数据存储速度、平均数据访问速度、平均数据交互速度分别提升了87.43%、27.10%、58.16%.改进后的乳制品溯源系统存储和处理大规模数据的能力明显优于传统的乳制品溯源系统.
-
关键词:
- Hadoop/Hive /
- 乳制品溯源 /
- 数据计算 /
- 性能优化
Abstract: In order to enhance the performance of traditional dairy traceability systems for the production data of large-scale enterprise, this paper analyzed the supply chain process of dairy enterprises, key traceability units and traceability information; combining Hadoop/Hive big data technology and distributed database technology, the paper designed and constructed a dairy products traceability framework based on Hadoop/Hive. We built a simulated large-scale data environment and used actual production data to test the system performance. The experimental results showed that after the introduction of the Hadoop/Hive technology system, the average data storage speed, the average data access speed, and the average data exchange rate increased by 87.43%, 27.10% and 58.16%, respectively. The improved traceability system for dairy products is superior to the traditional dairy traceability system in storing and processing large-scale data.-
Key words:
- Hadoop/Hive /
- dairy products traceability /
- data calculation /
- performance optimization
-
表 1 软硬件配置表
Tab. 1 Hardware and software configuration
软硬件 信息和设置 OS Ubuntu 12.04LTS Memory/Hard Disk 2 GB/100 GB CPU Intel(R) Core(TM)2Duo CPU E8400 @3.00GHZX2 Database MySQL Server5.0 Version Hadoop-2.5.2, Apache-hive-0.13.1, Sqoop-1.4.6 MySQL-Cluster-7.5.4 Tomcat7, Java 1.8 表 2 数据导入时间对比
Tab. 2 Data import consumption and time comparison
记录条数/万条 MySQL/s Hadoop/Hive/s 速度提升率/% 5 5.470 0.853 84.406 10 11.159 1.491 86.639 15 16.694 1.995 88.050 20 22.307 3.038 86.381 25 28.393 2.645 90.684 30 32.876 3.187 90.306 35 37.794 3.802 89.940 40 43.663 4.295 90.163 45 49.963 4.579 90.835 50 56.394 4.798 91.492 表 3 MySQL Cluster-Hive数据迁移平均耗时
Tab. 3 Average consumption time of MySQL Cluster-Hive data transfer
记录条数/万条 MySQL Cluster-Hive/s Txt-MySQL/s 速度提升率/% 200 48.608 118.02 58.814 400 124.053 248.223 50.024 600 210.446 451.022 53.340 800 170.608 497.065 65.677 1 000 214.316 642.317 66.634 1 200 289.474 831.472 65.185 1 400 378.214 916.446 58.730 1 600 438.589 1 077.745 59.305 1 800 515.656 1 212.649 57.477 2 000 686.056 1 280.414 46.419 表 4 Hive-MySQL Cluster数据迁移平均耗时
Tab. 4 Average consumption time of Hive-MySQL Cluster data transfer
记录条数/万条 Hive-MySQL Cluster/s Txt-MySQL/s 速度提升率/% 5 27.579 5.47 -404.186 10 31.815 11.159 -185.106 15 30.132 16.694 -80.496 20 34.147 22.307 -53.078 25 33.764 28.393 -18.917 30 36.897 32.876 -12.231 35 37.798 37.794 -0.011 40 40.013 43.663 8.359 45 40.125 49.963 19.691 50 41.014 56.394 27.272 -
[1] ABOUZIED A, BAJDA-PAWLIKOWSKI K, HUANG J, et al. HadoopDB in action: Building real world applications[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2010: 1111-1114. [2] ABOUZEID A, BAJDA-PAWLIKOWSKI K, ABADI D, et al. HadoopDB:An architectural hybrid of MapReduce and DBMS technologies for analytical workloads[J]. Proceedings of the VLDB Endowment, 2009, 2(1):922-933. doi: 10.14778/1687627 [3] ISMAIL A S, AL-FEEL H, MOKHTAR H M O. Introducing a new arabic endpoint for DBpedia internationalization project[C]//International Database Engineering & Applications Symposium. ACM, 2016: 284-289. [4] TORRES D, SKAF-MOLLI H, MOLLI P. et al. BlueFinder: Recommending wikipedia links using DBpedia properties[C]//Proceedings of the 5th Annual ACM Web Science Conference (WebSci'13). New York: ACM, 2013: 413-422. DOI: https://doi.org/10.1145/2464464.2464515. [5] 叶育鑫, 欧阳丹彤.混合语义约简和选择估值优化SPARQL[J].电子学报, 2010, 38(5):1205-1210. https://www.wenkuxiazai.com/word/9256b54d2b160b4e767fcf0c-1.doc [6] 王德文, 肖凯, 肖磊.基于Hive的电力设备状态信息数据仓库[J].电力系统保护与控制, 2013(9):125-130. doi: 10.7667/j.issn.1674-3415.2013.09.021 [7] 曲朝阳, 朱莉, 张士林.基于Hadoop的广域测量系统数据处理[J].电力系统自动化, 2013, 37(4):92-97. doi: 10.7500/AEPS201111169 [8] 刘越, 李锦涛, 虎嵩林.基于代价估计的Hive多维索引分割策略选择算法[J].计算机研究与发展, 2016, 53(4):798-810. doi: 10.7544/issn1000-1239.2016.20151163 [9] 董新华, 李瑞轩, 周湾湾, 等. Hadoop系统性能优化与功能增强综述[J].计算机研究与发展, 2013, 50(s2):1-15. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjyjyfz2013z2002 [10] THUSOO A, SARMA J S, JAIN N, et al. Hive:A warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2):1626-1629. doi: 10.14778/1687553 [11] OLSEN P, BORIT M. How to define traceability[J]. Trends in Food Science & Technology, 2013, 29(2):142-150. https://www.sciencedirect.com/science/article/pii/S0924224412002117 [12] 钱建平, 刘学馨, 杨信廷, 等.可追溯系统的追溯粒度评价指标体系构建[J].农业工程学报, 2014, 30(1):98-104. http://www.oalib.com/paper/4922974 [13] SHVACHKO K, KUANG H, RADIA S, et al. The Hadoop distributed file system[C]//Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies. Washington: IEEE Computer Society, 2010: 1-10. DOI: 10.1109/MSST.2010.5496972. [14] 张良均. Hadoop大数据分析与挖掘实战[M].北京:机械工业出版社, 2016. [15] 荀亚玲, 张继福, 秦啸. MapReduce集群环境下的数据放置策略[J].软件学报, 2015(8):2056-2073. http://www.oalib.com/paper/5071219 [16] 叶晓江, 刘鹏.实战Hadoop2.0从云计算到大数据[M].北京:电子工业出版社, 2016. [17] 张佳兰, 昝林森, 刘永峰, 等.我国DHI测定现状及存在的问题[J].中国牛业科学, 2007, 33(5):56-59. http://epub.cqvip.com/articledetail.aspx?id=1000000427570