Business circle population mobility statistics based on mobile trajectory data
-
摘要: 随着城市化的推进以及大数据技术的不断发展,智慧商圈成为智慧城市建设的重要组成部分.智慧商圈的热门程度、消费者的规模、消费层次等因素成为智慧商圈建设的关注热点.然而,传统的消费者规模的统计,还是基于传统的问卷调查或者抽样等,这些方法不仅成本昂贵而且效率低下.但随着数据挖掘技术的发展,使得通过分析用户行为轨迹来确定商圈消费者规模成为可能.本文提出了一种基于轨迹数据分析的商圈消费者规模分析方法.本文的主要工作包括:① 在轨迹数据中,如何确定商圈的边界这是一个首要的问题,基于此,才能确定一位消费者是在商圈内活动,还是在商圈外面.本文提出了根据商圈内基站点的位置分布,运用k-NearestNeighbor(kNN)分类算法,对该商圈的范围进行圈定的方法.② 由于轨迹数据的不确定性特点,确定一个用户与商圈的关系也是一个难题.本文利用计算不规则多边形面积的方法计算基站点的权重值,结合时间阈值分析该区域内每天的消费者规模.③ 最后,鉴于轨迹数据的海量性,本文提出了一个大数据计算框架BPDA(Business-Circle Parallel Distributed Algorithm),基于Hadoop大数据处理平台和Kafka分布式消息系统,实现了基于移动轨迹数据的商圈消费者规模分析系统,并使用中山公园商圈基站数据,展示了本文所提方法的可行性.Abstract: With the advancement of urbanization and continental development of big data technology, smart business has become an important part of smart city construction. The popularity, consumer number scale and consumption level of smart business also become the hot spot in the construction of smart city. However, traditional consumer statistics method is based on traditional survey and sampling, etc. All of these traditional methods are high-cost and inefficient. Fortunately, the fast development of data mining technology makes statistics in business circle by analyzing user behavior trajectory data possible. In this paper, we propose a consumer scale analysis method on business circle using user trajectory data. There are three mainly work parts:① How to determine the real boundary of business circle in trajectory data analysis domain is a primary problem, and we can judge a consumer activity within or outside the business circle based on it. Facing this issue, we raise a new method to delineate business circle using k-Nearest Neighbor(kNN) classification algorithm based on the location of base station within business circle.② How to determine the relationship between user and business circle is also a new problem due to uncertainty of trajectory characteristics. We calculate irregular polygon area to evaluate the weight of each base station and also combine with time threshold in order to analyze consumer scale every day.③ Finally, considering large amounts in trajectory data, we propose a big data computing framework BPDA (Business-Circle Parallel Distributed Algorithm), which is based on Hadoop big data platform and Kafka distributed message system, to implement business circle consumers scale analysis system. Moreover, we take Zhongshan Park business circle as an instance to verify the feasibility of our algorithm.
-
算法1 商圈识别算法 输入:所有基站点集合 $S_1=\{S_1, S_2, \cdots, S_m\}$ 商圈的最大经度lngmax, 最小经度lngmin, 最大纬度latmax, 最小纬度latmin 商圈覆盖点集合 $S_2=\{a_1, a_2, \cdots, a_m\}$ 阈值 $\tau $ 输出:商圈轮廓集合 $S_3=\{\}$ 1.lat $_{\max}\leftarrow$ Max $_{\rm lat}(S_1) $ 2.lat $_{\min}\leftarrow$ Min $_{\rm lat}(S_1) $ 3.lng $_{\max}\leftarrow$ Max $_{\rm lng}(S_1) $ 4.lng $_{\min}\leftarrow$ Min $_{\rm lng}(S_1) $ 5.For lng $_{\max}\sim$ lngmin m次均分 6.For lat $_{\max}\sim$ latmin n次均分 7. $S_3\leftarrow p$ (lng $_i$ , lat $_i$ ) 8.Endfor 9.Endfor 10.For i from 1:m 11. $D_i(d_i, a_i, s_i)\leftarrow$ distance( $s_i, a_i$ ) 12.按照 $d_i$ 升序排列距离集合 $D_i$ , Sort $_{\rm asc}(D_i)$ , 取Topk 13.If k条记录中 $s_i$ 满足count $_{\scriptsize\mbox{商圈}}>$ count $_{\scriptsize \mbox{非商圈}}$ , 且 $s_1$ , $s_2$ 中一点为商圈基站, 一点为非商圈基站, $|d_2-d_1|<\tau$ 14. $S_3\leftarrow a_i$ 15.End if 16. $S_2$ delete( $a_i$ ) 17.Endfor 18.Return $S_3$ 算法2 商圈轮廓优化算法 输入:商圈轮廓集合 $S_3=\{a_1, a_2, \cdots, a_m\}$ 阈值 $\tau _{2}$ , 输出:优化之后的商圈轮廓集合 $S_4$ 1. $S_4\leftarrow a_1$ , $c\leftarrow$ null 2.For i from 2:m 3.If $a_i\leq c$ 4.If第一次从 $S_3$ 中取覆盖点 5. $c\leftarrow a_i$ 6.End if 7. $b\leftarrow a_{i+1}$ 8.If distance (c, b) $<\tau_2$ 9. $c\leftarrow b$ 10. $S_4\leftarrow b$ 11.End if 12.Else 13.Delete $b$ from $S_3$ 14.End if 15.End for 16.Return $S_4$ 算法3 计算基站权重值算法 输入:商圈基站点集合 $S_5=\{s_1, s_2, \cdots, s_n\}$ 随机样本点集合 $S_{\rm random}^i=\{p_1, p_2, \cdots, p_n\}$ 输出:带权重的商圈基站点集合 $S_6=\{(s_1, w_1), (s_2, w_2), \cdots, (s_n, w_n)\}$ 1.For i from 1:n 2.If $s_i$ 为宏站 3. $\bigcirc\leftarrow$ 构造半径为 $r_1=2\;500$ m的圆 4.For j from 1:m 5. $S_{\rm random}^j\leftarrow$ random( $p_j$ ) 6.End for 7.基站点 $s_i$ 的权重值 $w_i\leftarrow$ (商圈内的随机点数目count( $S_{\rm random}^j$ ))/m 8. $S_6\leftarrow(s_i, w_i)$ 9.End if 10.Else 11. $\bigcirc\leftarrow$ 构造半径为 $r_2$ =500 m的圆 12.For j from 1:m 13. $S_{\rm random}^j\leftarrow$ random $(p_j)$ 14.End for 15.基站点 $s_i$ 的权重值 $w_i\leftarrow$ (商圈内的随机点数目count( $S_{\rm random}^j$ ))/ ${m}$ 16. $S_6\leftarrow(s_i, w_i)$ 17.End for 18.Return $S_6$ 算法4 商圈消费者规模统计算法 Map部分 输入:有序的移动轨迹数据集合 $S_7=\{({\rm sta}_1, t_1), ({\rm sta}_2, t_2), \cdots, ({\rm sta}_n, t_n)\}$ 带权重的商圈基站集合 $S_6=\{(s_1, w_1), (s_2, w_2), \cdots, (s_n, w_n)\}$ 输出:不同时间段消费者规模 $\langle t_i, 1\rangle$ 1.For i from 1:n 2. $a \leftarrow$ (sta $_i$ , $t_i$ ) 3.If $a$ 是第一次读到, 并且sta $_i$ 是商圈基站的数据 4. $t_{\rm reserved}\leftarrow t_i$ 5.sta $_{\rm reserved}\leftarrow$ sta $_i$ 6. $\Delta t_i\leftarrow t-t_{\rm reserved}$ 7. $\Delta t_{\rm total}\leftarrow \Delta t_{\rm total}+\Delta t_{i}× w_i$ 8.Continue 9.End if 10.If sta $_{\rm reserved}$ 是商圈基站 11.If sta $_{i}$ 是商圈基站 12. $\Delta t_i\leftarrow t-t_{\rm reserved}$ 13. $\Delta t_{\rm total}\leftarrow \Delta t_{\rm total}+\Delta t_{i}× w_i$ 14. End if 15.Else 16. $\Delta t_i\leftarrow t-t_{\rm reserved}$ 17.If $\Delta t_{i}$ 小于时间阈值 $\S_1$ 18. $\Delta t_{\rm total}\leftarrow \Delta t_{\rm total}+\Delta t_{i}× w_i$ 19.End if 20.Else 21.If $\Delta t_{\rm total}$ 大于时间阈值 $\S_2$ 22.If用户不是居住或者工作在商圈周围 23.timesplit(( $t_{\rm begin}$ , $t_{\rm end}$ , $\Delta t_{\rm total}$ ), 1 h) 24.Emit( $\langle t_i, 1 \rangle$ ) 25.End if 26.End if 27.End if 28.End for 29. $t_{\rm reserved}\leftarrow t$ 30.sta $_{\rm reserved}\leftarrow$ sta $_{i}$ [2mm]Redece部分 输入:指定日期时间段商圈消费用户集合 $\langle t_i, [1, 1, 1, \cdots]\rangle$ 输出:指定日期时间段商圈消费用户总数 $\langle t_i, {\rm num}_i\rangle$ 31. ${\rm num}_i\leftarrow$ count( $\langle t_i, [1, 1, 1, \cdots]\rangle$ ) 32.Emit $(\langle t_i, {\rm num}_i\rangle)$ OIDD数据 用户ID 日期 时间 基站ID 东经度 北纬度 $\langle1331180$ xxxx 2014-09-15 09:50:52 C0601 121.454 7222° 31.410 555 56° 商圈的基站点数据 基站ID 东经度 北纬度 类型 商圈ID 位置 $\langle$ C1270 121.247 82° 31.392 36° 宏站 8 xx大厦xx路x号 $\rangle$ 表 1 志愿者商圈消费次数分布
Tab. 1 Business consumption time distribution
消费次数/次 人数/人 百分比/% 0 63 66 1 21 22 2-3 8 8 >3 4 4 表 2 志愿者经过商圈次数分布
Tab. 2 Business passing time distribution
经过次数/次 人数/人 百分比/% 0 28 29 1-3 39 40 4-10 23 25 >10 6 6 表 3 BPDA算法性能统计
Tab. 3 BPDA performance
$\S_1$ /s $\S_2$ /s 准确率/% 召回率/% F1-score 1 200 2 400 83.8 86.1 0.849 1 200 3 600 87.9 83.9 0.858 1 200 4 200 88.7 78.3 0.831 1 800 2 400 82.4 86.5 0.844 1 800 3 600 87.8 84.3 0.859 1 800 4 200 88.1 79.8 0.837 -
[1] YUAN J, ZHENG Y, XIE X. Discovering regions of different functions in a city using human mobility and pois[C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012: 186-194. [2] YUAN N J, ZHENG Y, XIE X, et al. Discovering urban functional zones using latent activity trajectories[J]. IEEE Transactions on Knowledge & Data Engineering, 2015, 27(3): 712-725. https://www.computer.org/csdl/trans/tk/2015/03/06871403.pdf [3] QI G, LI X, LI S, et al. Measuring social functions of city regions from large-scale taxi behaviors[C]//IEEE International Conference on Pervasive Computing and Communications Workshops. IEEE, 2011: 384-388. [4] GODDARD J B. Functional regions within the city centre: A study by factor analysis of taxi flows in central London[J]. Transactions of the Institute of British Geographers, 1970, 49(49): 161-182. http://www.jstor.org/stable/621647?origin=crossref [5] VATSAVAI R R, BRIGHT E, VARUN C, et al. Machine learning approaches for high-resolution urban land cover classification: A comparative study[C]//Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications. ACM, 2011: Article No 11. [6] ANTIKAINEN J. The concept of functional urban area(Findings on the ESPON project 1.1.1)[J]. Informationen Zur Raumentwicklung, 2005, 7: 447-456. [7] KARLSSON C. Clusters, functional regions and cluster policies[R/OL]. JIBS CESIS Electron, Working Paper Ser (84). [2016-06-01]. https://www.researchgate.net/publication/5094404. [8] BIRANT D, KUT A. ST-DBSCAN: An algorithm for clustering spatial–temporal data[J]. Data & Knowledge Engineering, 2007, 60(1): 208-221. http://www.wenkuxiazai.com/doc/0152f217bceb19e8b9f6ba13.html [9] CHEN X C, FAGHMOUS J H, KHANDELWAL A. Clustering dynamic spatio-temporal patterns in the presence of noise and missing data[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). 2015: 2575-2581. [10] BIRANT D, KUT A. ST-DBSCAN: An algorithm for clustering spatial-temporal data[J]. Data & Knowledge Engineering, 2007, 60(1): 208-221. http://linkinghub.elsevier.com/retrieve/pii/S0169023X06000218 [11] SLINK S R. An optimally efficient algorithm for the single-link cluster method[J]. The Computer Journal, 1973, 16(1): 30-34. doi: 10.1093/comjnl/16.1.30 [12] ZHANG M L, ZHOU Z H. ML-kNN: A lazy learning approach to multi-label learning[J]. Pattern recognition, 2007, 40(7): 2038-2048. doi: 10.1016/j.patcog.2006.12.019 [13] ZHANG H, BERG A C, MAIRE M, et al. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2006: 2126-2136. [14] LI L, WEINBERG C R, DARDEN T A. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/kNN method[J]. Bioinformatics, 2001, 17(12): 1131-1142. doi: 10.1093/bioinformatics/17.12.1131 [15] 李秀娟. kNN分类算法研究[J].科技信息, 2009, 31: 81+383. doi: 10.3969/j.issn.1001-8972.2009.05.036 [16] WBITE T. O'Reilly: Hadoop权威指南[M]. 周敏奇, 王晓玲, 金澈清, 等, 译. 第2版. 北京: 清华大学出版社, 2011. [17] 章志刚, 金澈清, 王晓玲, 等.面向海量低质手机轨迹数据的重要位置发现[J].软件学报, 2016, 7: 1700-1714. http://www.cnki.com.cn/Article/CJFDTOTAL-RJXB201607009.htm [18] 吴松, 雒江涛, 周云峰, 等.基于移动网络信令数据的实时人流量统计方法[J].计算机应用研究, 2014(3): 776-779. http://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ201403034.htm [19] 沈泽, 吴松, 杨勇, 等.移动通信网信令处理平台的实时人流量统计方法[J].广东通信技术, 2013, 8: 56-60. doi: 10.3969/j.issn.1006-6403.2013.08.012 [20] 肖江, 丁亮, 束鑫, 等.一种基于计算机视觉的行人流量统计方法[J].信息技术, 2015, 8: 22-25. doi: 10.3969/j.issn.1674-2117.2015.05.010