中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于分布式平台Spark的空间文本查询分析

徐阳 王志杰 钱诗友

徐阳, 王志杰, 钱诗友. 基于分布式平台Spark的空间文本查询分析[J]. 华东师范大学学报(自然科学版), 2018, (5): 120-134, 153. doi: 10.3969/j.issn.1000-5641.2018.05.010
引用本文: 徐阳, 王志杰, 钱诗友. 基于分布式平台Spark的空间文本查询分析[J]. 华东师范大学学报(自然科学版), 2018, (5): 120-134, 153. doi: 10.3969/j.issn.1000-5641.2018.05.010
XU Yang, WANG Zhi-jie, QIAN Shi-you. Distributed spatio-textual analytics based on the Spark platform[J]. Journal of East China Normal University (Natural Sciences), 2018, (5): 120-134, 153. doi: 10.3969/j.issn.1000-5641.2018.05.010
Citation: XU Yang, WANG Zhi-jie, QIAN Shi-you. Distributed spatio-textual analytics based on the Spark platform[J]. Journal of East China Normal University (Natural Sciences), 2018, (5): 120-134, 153. doi: 10.3969/j.issn.1000-5641.2018.05.010

基于分布式平台Spark的空间文本查询分析

doi: 10.3969/j.issn.1000-5641.2018.05.010
基金项目: 

国家重点研发计划项目 2017YFC0803700

广东省科技计划项目 2015A030401057

广东省科技计划项目 2016B030307002

详细信息
    作者简介:

    徐阳, 男, 硕士研究生, 研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn

    通讯作者:

    王志杰, 男, 博士, 副研究员, 研究方向为数据挖掘等.E-mail:wangzhij5@mail.sysu.edu.cn

  • 中图分类号: TP311

Distributed spatio-textual analytics based on the Spark platform

  • 摘要: 随着基于位置服务应用的不断推广,空间文本数据查询的应用价值(例如结合地理位置和用户标签的社交推荐)也在不断提高.但是,随着数据规模的迅速增长,传统的基于单机环境实现的技术难以为用户提供低延时和高吞吐量的服务.为此,本文基于Spark平台对分布式环境下的空间文本查询算法进行了探究.采用了面向海量空间文本数据的两层索引框架(包括全局索引和局部索引),该框架利用了分阶段过滤的策略来处理分布式下的布尔范围查询问题.同时,针对空间文本相似连接提出了Prefix-RI结构并提出了相应的分布式算法.基于Spark平台实现了所提出的分布式算法,并通过大量的实验对比验证了所提出方法的优越性.
  • 图  1  基于Spark的两层索引框架

    Fig.  1  two-level index framework based on Spark

    图  2  RI$^+$索引结构的工作流

    Fig.  2  Workflow of the RI$^+$ index

    图  3  查询字符串数目的影响(OSM数据集)

    Fig.  3  Effect of query keywords (OSM Dataset)

    图  4  查询范围的影响(TX-CA数据集)

    Fig.  4  Effect of the query area (TX-CA Dataset)

    图  5  数据集大小的影响(OSM数据集)

    Fig.  5  Effect of data size (OSM Dataset)

    图  6  全局索引的有效性(TX-CA数据集)

    Fig.  6  Effect of the global index (TX-CA Dataset)

    图  7  文本相似度阈值的影响(OSM数据集)

    Fig.  7  Effect of textual similarity thresh-old (OSM Dataset)

    图  8  空间距离阈值的影响(OSM数据集)

    Fig.  8  Effect of spatial distance thresh-old (OSM Dataset)

    算法1: STJoin
    输入:空间距离阈值$\epsilon$, 文本Jaccard相似度阈值$\lambda$, 数据分区$\mathcal{R}_i$对应的Prefix-RI索引根节点$r$, 数据分区$\mathcal{S}_j$中的一个空间文本对象$o_s=(\rho_s, \gamma_s)$
    输出:$\mathcal{R}_i$中可以和$o_s$进行连接的空间文本对象集合
    1: $prefixLen$=$|\gamma_s|-\lceil \lambda \times |\gamma_s| \rceil+1$;
    2: $S$=ø, $R$=ø;
    3: $S$.push($r$);
    4: while $S\neq$ ø do
    5:     $n$=$S$.pop();
    6:     If $n$ is a non-leaf node then
    7:          $bloomFilter$=$n.bloomFilter$;
    8:          $flag$=False;
    9:          for $i$ from 0 to prefixLen-1 do
    10:             if $bloomFilter$ contains $\gamma_s[i]$ then
    11:                  $flag$=True;
    12:                  break;
    13:          if $flag$ ==True then
    14:               for each entry $e_i$ of $n$ do
    15:                    if $dist(e_i.mbr, \rho_s) \leq \epsilon$ then
    16:                         $S$.push($e_i$.node);
     
    17: else
    18:     $inverted$=$n.invertedList$;
    19:     $candidates$=ø;
    20:     for $i$ from 0 to prefixLen -1 do
    21:          foreach object $o$ in $inverted.get(\gamma_s[i])$ do
    22:               if $dist(\rho_s, o.\rho) \leq \epsilon$ and $\lambda \times |o.\gamma| \leq |\gamma_s|$ and$\lambda\times |\gamma_s| \leq |o.\gamma|$ then
    23:                    $candidates$.add($o$);
    24:     $R$++=verify($candidates$);
    25: return $R$;
    下载: 导出CSV

    表  1  分布式的空间文本相似连接方案描述

    Tab.  1  Distributed spatio-textual similarity join algorithms

    方案简称 方案描述
    Spatial-first 基于5.2.1小节中的第一种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用空间优先的空间文本连接方案对其进行处理
    Textual-first 基于5.2.1小节中的第一种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用文本优先的空间文本连接方案对其进行处理
    Hybrid 基于5.2.1小节中的第一种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用基于Prefix-RI索引的空间文本连接方案对其进行处理
    Allpairs-Spatial-first 基于5.2.1小节中的第二种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用空间优先的空间文本连接方案对其进行处理
    Allpairs-Textual-first 基于5.2.1小节中的第二种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用文本优先的空间文本连接方案对其进行处理
    Allpairs-Hybrid 基于5.2.1小节中的第二种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用基于Prefix-RI索引的空间文本连接方案对其进行处理
    下载: 导出CSV
  • [1] CHEN L, CONG G, JENSEN C S, et al. Spatial keyword query processing: An experimental evaluation[C]//International Conference on Very Large Data Bases. Trondheim, Norway: VLDB Endowment, 2013, 6(3): 217-228. http://www.vldb.org/pvldb/vol6/p217-chen.pdf
    [2] BOUROS P, GE S, MAMOULIS N. Spatio-textual similarity joins[J]. Proceedings of the VLDB Endowment, 2012, 6(1):1-12. doi:  10.14778/2428536
    [3] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//USENIX Association Usenix Conference on Networked Systems Design and Implementation. New York: ACM, 2012, 70(2): 2. http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
    [4] XIE D, LI F, YAO B, et al. Simba: Effcient in-memory spatial analytics[C]//International Conference on Management of Data. New York: ACM, 2016: 1071-1085. https://www.cs.utah.edu/~lifeifei/papers/simba.pdf
    [5] ELDAWY A, MOKBEL M F. Spatial Hadoop: A MapReduce framework for spatial data[C]//International Conference on Data Engineering. New York: IEEE, 2016: 1352-1363. https://www.microsoft.com/en-us/research/video/spatialhadoop-mapreduce-framework-big-spatial-data/
    [6] XIE D, LI F, YAO B, et al. Simba: Spatial in-memory big data analysis[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York: ACM, 2016: 86-89. https://www.cs.utah.edu/~lifeifei/papers/simba-demo.pdf
    [7] YAO B, ZHANG W, WANG Z J, et al. Distributed in-memory analytics for big temporal data[C]//International Conference on Database Systems for Advanced Applications. Berlin: Springer, 2018: 549-565. doi:  10.1007/978-3-319-91452-7_36
    [8] AJI A, WANG F, VO H, et al. Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J]. Proceedings of the VLDB Endowment, 2013, 6(11):1009-1020. doi:  10.14778/2536222
    [9] YAO B, LI F, HADJIELEFTHERIOU M, et al. Approximate string search in spatial databases[C]//International Conference on Data Engineering. New York: IEEE, 2010: 545-556. http://www.cs.sjtu.edu.cn/~yaobin/papers/icde10_sas.pdf
    [10] YAO B, TANG M, LI F. Multi-approximate-keyword routing in GIS data[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York: ACM, 2011: 201-210. http://www.cs.sjtu.edu.cn/~yaobin/makr/
    [11] LI F, YAO B, TANG M, et al. Spatial approximate string search[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6):1394-1409. doi:  10.1109/TKDE.2012.48
    [12] ZHOU Y, XIE X, WANG C, et al. Hybrid index structures for location-based web search[C]//International Conference on Information and Knowledge Management. New York: ACM, 2005: 155-162. https://dl.acm.org/citation.cfm?id=1099584
    [13] BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-tree:An effcient and robust access method for points and rectangles[J]. ACM Sigmod Record, 1990, 19(2):322-331. doi:  10.1145/93605
    [14] GUTTMAN A. R-trees: A dynamic index structure for spatial searching[C]//International Conference on Management of Data. New York: ACM, 1984: 47-57.
    [15] WU D, JENSEN C S. A density-based approach to the retrieval of top-k spatial textual clusters[C]//International Conference on Information and Knowledge Management. New York: ACM, 2016: 2095-2100. https://www.researchgate.net/publication/310823546_A_Density-Based_Approach_to_the_Retrieval_of_Top-K_Spatial_Textual_Clusters
    [16] CHOUDHURY F M, CULPEPPER J S, SELLIS T. Batch processing of top-k spatial-textual queries[C]//International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data. New York: ACM, 2015: 7-12. http://www.culpepper.io/publications/ccs15-georich.pdf
    [17] LUO C, LI J, LI G, et al. Effcient reverse spatial and textual k nearest neighbor queries on road networks[J].Knowledge-Based Systems, 2016, 93:121-134. doi:  10.1016/j.knosys.2015.11.009
    [18] CHEN Z, CONG G, ZHANG Z, et al. Distributed publish/subscribe query processing on the spatio-textual data stream[C]//International Conference on Data Engineering. New York: IEEE, 2017: 1095-1106.
    [19] YAO B, LI F, KUMAR P. K nearest neighbor queries and knn-joins in large relational databases (almost) for free[C]//International Conference on Data Engineering. New York: IEEE, 2010: 4-15. https://www.cs.utah.edu/~lifeifei/papers/icde10_knn.pdf
    [20] 徐石磊, 王雷, 胡卉芪.基于分布式系统OceanBase的并行连接[J].华东师范大学学报(自然科学版), 2017(5):1-10. doi:  10.3969/j.issn.1000-5641.2017.05.001
    [21] SARAWAGI S, KIRPAL A. Effcient set joins on similarity predicates[C]//International Conference on Management of Data. New York: ACM, 2004: 743-754. http://www.cs.cmu.edu/~natassa/courses/15-721/resources/set_join.pdf
    [22] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//International Conference on Data Engineering. New York: IEEE, 2006: 5. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ssjoin.pdf
    [23] BAYARDO R J, MA Y, SRIKANT R. Scaling up all pairs similarity search[C]//International Conference on World Wide Web. New York: ACM, 2007: 131-140. https://wenku.baidu.com/view/027d1ad084254b35eefd3471.html
    [24] FAN L, CAO P, ALMEIDA J, et al. Summary cache:A scalable wide-area web cache sharing protocol[J].IEEE/ACM Transactions on Networking, 2000, 8(3):281-293. doi:  10.1109/90.851975
    [25] XU Y, YAO B, WANG Z J, et al. Skia: Scalable and effcient in-memory analytics for big spatialtextualdata[EB/OL]. (2018-05-15)[2018-06-18]. https://www.researchgate.net/publication/326352693.
    [26] LEUTENEGGER S T, LOPEZ M A, EDGINGTON J. STR: a simple and effcient algorithm for R-tree packing[C]//International Conference on Data Engineering. New York: IEEE, 1997: 497-506.
    [27] OpenStreetMap Foundation. Openstreetmap project[EB/OL]. (2016-02-12)[2018-06-18]. http://www.openstreetmap.org.
    [28] LESKOVEC J, KREYL A. SNAP Datasets: Stanford large network dataset collection[EB/OL]. (2014-06-01)[2018-06-18]. http://snap.stanford.edu/data.
    [29] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C]//International Conference on World Wide Web. New York: ACM, 2006: 377-386.
    [30] YAO B, CHEN Z, GAO X, et al. Flexible aggregate nearest neighbor queries in road networks[C]//International Conference on Data Engineering. New York: IEEE, 2018: 1-12. http://mashuai.buaa.edu.cn/pubs/icde2018b.pdf
    [31] MA J, YAO B, GAO X, et al. Top-k Critical Vertices Query on Shortest Path[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 99(1):1-13. http://www.computer.org/csdl/trans/tk/preprint/08300661-abs.html
    [32] XIE D, LI G, YAO B, et al. Practical private shortest path computation based on oblivious storage[C]//International Conference on Data Engineering. New York: IEEE, 2016: 361-372. http://www.cs.utah.edu/~dongx/paper/psp-icde.pdf
    [33] YAO B, XIAO X, LI F, et al. Dynamic monitoring of optimal locations in road network databases[J]. The International Journal on Very Large Data Bases, 2014, 23(5):697-720. doi:  10.1007/s00778-013-0347-5
    [34] XIAO X, YAO B, LI F. Optimal location queries in road network databases[C]//International Conference on Data Engineering. New York: IEEE, 2011: 804-815. http://www.cs.sjtu.edu.cn/~yaobin/papers/olq.pdf
  • 加载中
图(8) / 表(2)
计量
  • 文章访问数:  136
  • HTML全文浏览量:  37
  • PDF下载量:  213
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-07-09
  • 刊出日期:  2018-09-25

目录

    /

    返回文章
    返回