基于分布式平台Spark的空间文本查询分析

徐阳; 王志杰; 钱诗友

doi:10.3969/j.issn.1000-5641.2018.05.010

基于分布式平台Spark的空间文本查询分析

doi: 10.3969/j.issn.1000-5641.2018.05.010

徐阳^1,,
王志杰^2, ,,
钱诗友¹

1.
上海交通大学计算机科学与工程系, 上海 200240
2.
中山大学数据科学与计算机学院, 广州 510006

基金项目:

国家重点研发计划项目 2017YFC0803700

广东省科技计划项目 2015A030401057

广东省科技计划项目 2016B030307002

详细信息

作者简介:
徐阳, 男, 硕士研究生, 研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn

通讯作者:
王志杰, 男, 博士, 副研究员, 研究方向为数据挖掘等.E-mail:wangzhij5@mail.sysu.edu.cn

中图分类号: TP311
计量
- 文章访问数: 136
- HTML全文浏览量: 37
- PDF下载量: 213
- 被引次数: 0
出版历程
- 收稿日期: 2018-07-09
- 刊出日期: 2018-09-25

Distributed spatio-textual analytics based on the Spark platform

1.
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2.
School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China

摘要

摘要: 随着基于位置服务应用的不断推广，空间文本数据查询的应用价值（例如结合地理位置和用户标签的社交推荐）也在不断提高.但是，随着数据规模的迅速增长，传统的基于单机环境实现的技术难以为用户提供低延时和高吞吐量的服务.为此，本文基于Spark平台对分布式环境下的空间文本查询算法进行了探究.采用了面向海量空间文本数据的两层索引框架（包括全局索引和局部索引），该框架利用了分阶段过滤的策略来处理分布式下的布尔范围查询问题.同时，针对空间文本相似连接提出了Prefix-RI结构并提出了相应的分布式算法.基于Spark平台实现了所提出的分布式算法，并通过大量的实验对比验证了所提出方法的优越性.
- 分布式计算 /
- 空间文本分析 /
- 相似连接
Abstract: With the rapid development of location-based services, spatio-textual data analytics is becoming increasingly important. For instance, it is widely used in social recommendation applications. However, performing efficient analysis on large spatio-textual datasets in a central environment remains a big challenge. This paper explored distributed algorithms for spatio-textual analytics based on the Spark platform. Speciffically, we proposed a scalable two-level index framework, which processes spatio-textual queries in two steps. The global index is highly scalable and it can retrieve candidate partitions with only a few false positives. The local index is designed based on pruning ability of infrequent keywords and used for each candidate partition. We implemented the proposed distributed algorithms in Spark. Extensive experiments demonstrated promising performance for the proposed solution.
- distributed processing /
- spatio-textual analytics /
- similarity join

HTML全文

图 1 基于Spark的两层索引框架

Fig. 1 two-level index framework based on Spark

下载: 全尺寸图片幻灯片

图 2 RI$^+$索引结构的工作流

Fig. 2 Workflow of the RI$^+$ index

下载: 全尺寸图片幻灯片

图 3 查询字符串数目的影响(OSM数据集)

Fig. 3 Effect of query keywords (OSM Dataset)

下载: 全尺寸图片幻灯片

图 4 查询范围的影响(TX-CA数据集)

Fig. 4 Effect of the query area (TX-CA Dataset)

下载: 全尺寸图片幻灯片

图 5 数据集大小的影响(OSM数据集)

Fig. 5 Effect of data size (OSM Dataset)

下载: 全尺寸图片幻灯片

图 6 全局索引的有效性(TX-CA数据集)

Fig. 6 Effect of the global index (TX-CA Dataset)

下载: 全尺寸图片幻灯片

图 7 文本相似度阈值的影响(OSM数据集)

Fig. 7 Effect of textual similarity thresh-old (OSM Dataset)

下载: 全尺寸图片幻灯片

图 8 空间距离阈值的影响(OSM数据集)

Fig. 8 Effect of spatial distance thresh-old (OSM Dataset)

下载: 全尺寸图片幻灯片

算法1: STJoin
输入:空间距离阈值$\epsilon$, 文本Jaccard相似度阈值$\lambda$, 数据分区$\mathcal{R}_i$对应的Prefix-RI索引根节点$r$, 数据分区$\mathcal{S}_j$中的一个空间文本对象$o_s=(\rho_s, \gamma_s)$
输出:$\mathcal{R}_i$中可以和$o_s$进行连接的空间文本对象集合
1: $prefixLen$=$\|\gamma_s\|-\lceil \lambda \times \|\gamma_s\| \rceil+1$;
2: $S$=ø, $R$=ø;
3: $S$.push($r$);
4: while $S\neq$ ø do
5: $n$=$S$.pop();
6: If $n$ is a non-leaf node then
7: $bloomFilter$=$n.bloomFilter$;
8: $flag$=False;
9: for $i$ from 0 to prefixLen-1 do
10: if $bloomFilter$ contains $\gamma_s[i]$ then
11: $flag$=True;
12: break;
13: if $flag$ ==True then
14: for each entry $e_i$ of $n$ do
15: if $dist(e_i.mbr, \rho_s) \leq \epsilon$ then
16: $S$.push($e_i$.node);

17: else
18: $inverted$=$n.invertedList$;
19: $candidates$=ø;
20: for $i$ from 0 to prefixLen -1 do
21: foreach object $o$ in $inverted.get(\gamma_s[i])$ do
22: if $dist(\rho_s, o.\rho) \leq \epsilon$ and $\lambda \times \|o.\gamma\| \leq \|\gamma_s\|$ and$\lambda\times \|\gamma_s\| \leq \|o.\gamma\|$ then
23: $candidates$.add($o$);
24: $R$++=verify($candidates$);
25: return $R$;

下载: 导出CSV

表 1 分布式的空间文本相似连接方案描述

Tab. 1 Distributed spatio-textual similarity join algorithms

方案简称	方案描述
Spatial-first	基于5.2.1小节中的第一种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用空间优先的空间文本连接方案对其进行处理
Textual-first	基于5.2.1小节中的第一种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用文本优先的空间文本连接方案对其进行处理
Hybrid	基于5.2.1小节中的第一种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用基于Prefix-RI索引的空间文本连接方案对其进行处理
Allpairs-Spatial-first	基于5.2.1小节中的第二种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用空间优先的空间文本连接方案对其进行处理
Allpairs-Textual-first	基于5.2.1小节中的第二种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用文本优先的空间文本连接方案对其进行处理
Allpairs-Hybrid	基于5.2.1小节中的第二种分布式空间连接方案产生局部分区对.然后, 对每个局部分区对, 利用基于Prefix-RI索引的空间文本连接方案对其进行处理

下载: 导出CSV

参考文献(34)

[1]	CHEN L, CONG G, JENSEN C S, et al. Spatial keyword query processing: An experimental evaluation[C]//International Conference on Very Large Data Bases. Trondheim, Norway: VLDB Endowment, 2013, 6(3): 217-228. http://www.vldb.org/pvldb/vol6/p217-chen.pdf
[2]	BOUROS P, GE S, MAMOULIS N. Spatio-textual similarity joins[J]. Proceedings of the VLDB Endowment, 2012, 6(1):1-12. doi: 10.14778/2428536
[3]	ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//USENIX Association Usenix Conference on Networked Systems Design and Implementation. New York: ACM, 2012, 70(2): 2. http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
[4]	XIE D, LI F, YAO B, et al. Simba: Effcient in-memory spatial analytics[C]//International Conference on Management of Data. New York: ACM, 2016: 1071-1085. https://www.cs.utah.edu/~lifeifei/papers/simba.pdf
[5]	ELDAWY A, MOKBEL M F. Spatial Hadoop: A MapReduce framework for spatial data[C]//International Conference on Data Engineering. New York: IEEE, 2016: 1352-1363. https://www.microsoft.com/en-us/research/video/spatialhadoop-mapreduce-framework-big-spatial-data/
[6]	XIE D, LI F, YAO B, et al. Simba: Spatial in-memory big data analysis[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York: ACM, 2016: 86-89. https://www.cs.utah.edu/~lifeifei/papers/simba-demo.pdf
[7]	YAO B, ZHANG W, WANG Z J, et al. Distributed in-memory analytics for big temporal data[C]//International Conference on Database Systems for Advanced Applications. Berlin: Springer, 2018: 549-565. doi: 10.1007/978-3-319-91452-7_36
[8]	AJI A, WANG F, VO H, et al. Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J]. Proceedings of the VLDB Endowment, 2013, 6(11):1009-1020. doi: 10.14778/2536222
[9]	YAO B, LI F, HADJIELEFTHERIOU M, et al. Approximate string search in spatial databases[C]//International Conference on Data Engineering. New York: IEEE, 2010: 545-556. http://www.cs.sjtu.edu.cn/~yaobin/papers/icde10_sas.pdf
[10]	YAO B, TANG M, LI F. Multi-approximate-keyword routing in GIS data[C]//ACM Sigspatial International Conference on Advances in Geographic Information Systems. New York: ACM, 2011: 201-210. http://www.cs.sjtu.edu.cn/~yaobin/makr/
[11]	LI F, YAO B, TANG M, et al. Spatial approximate string search[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6):1394-1409. doi: 10.1109/TKDE.2012.48
[12]	ZHOU Y, XIE X, WANG C, et al. Hybrid index structures for location-based web search[C]//International Conference on Information and Knowledge Management. New York: ACM, 2005: 155-162. https://dl.acm.org/citation.cfm?id=1099584
[13]	BECKMANN N, KRIEGEL H P, SCHNEIDER R, et al. The R*-tree:An effcient and robust access method for points and rectangles[J]. ACM Sigmod Record, 1990, 19(2):322-331. doi: 10.1145/93605
[14]	GUTTMAN A. R-trees: A dynamic index structure for spatial searching[C]//International Conference on Management of Data. New York: ACM, 1984: 47-57.
[15]	WU D, JENSEN C S. A density-based approach to the retrieval of top-k spatial textual clusters[C]//International Conference on Information and Knowledge Management. New York: ACM, 2016: 2095-2100. https://www.researchgate.net/publication/310823546_A_Density-Based_Approach_to_the_Retrieval_of_Top-K_Spatial_Textual_Clusters
[16]	CHOUDHURY F M, CULPEPPER J S, SELLIS T. Batch processing of top-k spatial-textual queries[C]//International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data. New York: ACM, 2015: 7-12. http://www.culpepper.io/publications/ccs15-georich.pdf
[17]	LUO C, LI J, LI G, et al. Effcient reverse spatial and textual k nearest neighbor queries on road networks[J].Knowledge-Based Systems, 2016, 93:121-134. doi: 10.1016/j.knosys.2015.11.009
[18]	CHEN Z, CONG G, ZHANG Z, et al. Distributed publish/subscribe query processing on the spatio-textual data stream[C]//International Conference on Data Engineering. New York: IEEE, 2017: 1095-1106.
[19]	YAO B, LI F, KUMAR P. K nearest neighbor queries and knn-joins in large relational databases (almost) for free[C]//International Conference on Data Engineering. New York: IEEE, 2010: 4-15. https://www.cs.utah.edu/~lifeifei/papers/icde10_knn.pdf
[20]	徐石磊, 王雷, 胡卉芪.基于分布式系统OceanBase的并行连接[J].华东师范大学学报(自然科学版), 2017(5):1-10. doi: 10.3969/j.issn.1000-5641.2017.05.001
[21]	SARAWAGI S, KIRPAL A. Effcient set joins on similarity predicates[C]//International Conference on Management of Data. New York: ACM, 2004: 743-754. http://www.cs.cmu.edu/~natassa/courses/15-721/resources/set_join.pdf
[22]	CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C]//International Conference on Data Engineering. New York: IEEE, 2006: 5. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ssjoin.pdf
[23]	BAYARDO R J, MA Y, SRIKANT R. Scaling up all pairs similarity search[C]//International Conference on World Wide Web. New York: ACM, 2007: 131-140. https://wenku.baidu.com/view/027d1ad084254b35eefd3471.html
[24]	FAN L, CAO P, ALMEIDA J, et al. Summary cache:A scalable wide-area web cache sharing protocol[J].IEEE/ACM Transactions on Networking, 2000, 8(3):281-293. doi: 10.1109/90.851975
[25]	XU Y, YAO B, WANG Z J, et al. Skia: Scalable and effcient in-memory analytics for big spatialtextualdata[EB/OL]. (2018-05-15)[2018-06-18]. https://www.researchgate.net/publication/326352693.
[26]	LEUTENEGGER S T, LOPEZ M A, EDGINGTON J. STR: a simple and effcient algorithm for R-tree packing[C]//International Conference on Data Engineering. New York: IEEE, 1997: 497-506.
[27]	OpenStreetMap Foundation. Openstreetmap project[EB/OL]. (2016-02-12)[2018-06-18]. http://www.openstreetmap.org.
[28]	LESKOVEC J, KREYL A. SNAP Datasets: Stanford large network dataset collection[EB/OL]. (2014-06-01)[2018-06-18]. http://snap.stanford.edu/data.
[29]	SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C]//International Conference on World Wide Web. New York: ACM, 2006: 377-386.
[30]	YAO B, CHEN Z, GAO X, et al. Flexible aggregate nearest neighbor queries in road networks[C]//International Conference on Data Engineering. New York: IEEE, 2018: 1-12. http://mashuai.buaa.edu.cn/pubs/icde2018b.pdf
[31]	MA J, YAO B, GAO X, et al. Top-k Critical Vertices Query on Shortest Path[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 99(1):1-13. http://www.computer.org/csdl/trans/tk/preprint/08300661-abs.html
[32]	XIE D, LI G, YAO B, et al. Practical private shortest path computation based on oblivious storage[C]//International Conference on Data Engineering. New York: IEEE, 2016: 361-372. http://www.cs.utah.edu/~dongx/paper/psp-icde.pdf
[33]	YAO B, XIAO X, LI F, et al. Dynamic monitoring of optimal locations in road network databases[J]. The International Journal on Very Large Data Bases, 2014, 23(5):697-720. doi: 10.1007/s00778-013-0347-5
[34]	XIAO X, YAO B, LI F. Optimal location queries in road network databases[C]//International Conference on Data Engineering. New York: IEEE, 2011: 804-815. http://www.cs.sjtu.edu.cn/~yaobin/papers/olq.pdf

施引文献

资源附件(0)

访问统计

点击查看大图

图(8) / 表(2)

计量

文章访问数: 136
HTML全文浏览量: 37
PDF下载量: 213
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于分布式平台Spark的空间文本查询分析

doi: 10.3969/j.issn.1000-5641.2018.05.010

作者简介:
徐阳, 男, 硕士研究生, 研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn

通讯作者:
王志杰, 男, 博士, 副研究员, 研究方向为数据挖掘等.E-mail:wangzhij5@mail.sysu.edu.cn

计量

Distributed spatio-textual analytics based on the Spark platform

计量

目录

留言板

基于分布式平台Spark的空间文本查询分析

doi: 10.3969/j.issn.1000-5641.2018.05.010

作者简介: 徐阳, 男, 硕士研究生, 研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn

通讯作者: 王志杰, 男, 博士, 副研究员, 研究方向为数据挖掘等.E-mail:wangzhij5@mail.sysu.edu.cn

计量

出版历程

Distributed spatio-textual analytics based on the Spark platform

计量

出版历程

目录

作者简介:
徐阳, 男, 硕士研究生, 研究方向为分布式计算、大数据处理.E-mail:xuyangit@sjtu.edu.cn

通讯作者:
王志杰, 男, 博士, 副研究员, 研究方向为数据挖掘等.E-mail:wangzhij5@mail.sysu.edu.cn