Respected readers, authors and reviewers, you can add comments to this page on any questions about the contribution, review, editing and publication of this journal. We will give you an answer as soon as possible. Thank you for your support!
BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 261-270. doi: 10.3969/j.issn.1000-5641.2014.05.023
Citation:
BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 261-270. doi: 10.3969/j.issn.1000-5641.2014.05.023
BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 261-270. doi: 10.3969/j.issn.1000-5641.2014.05.023
Citation:
BIAN Hao-Qiong, CHEN Yue-Guo, DU Xiao-Yong, GAO Yan-Jie. Equi-join optimization on spark[J]. Journal of East China Normal University (Natural Sciences), 2014, (5): 261-270. doi: 10.3969/j.issn.1000-5641.2014.05.023
Equi-join is one of the most common and costly operations in data analysis. The implementations of equijoin on Spark are different from those in parallel databases. Equi-join algorithms based on data prepartitioning which are commonly used in parallel databases can hardly be implemented on Spark. Currently common used equijoin algorithms on Spark, such as broadcast join and repartition join, are not efficient. How to improve equijoin performance on Spark becomes the key of big data analysis on Spark. This work combines the advantages of SimiJoin and Partition Join and provides an optimized equijoin algorithm based on the features of Spark. It is indicated by cost analysis and experiment that this algorithm is 12 times faster than algorithms used in stateoftheart data analysis systems on Spark.