Algorithm for mining approximate frequent subgraphs in a single graph
-
摘要: 图数据的挖掘工作是数据挖掘工作中的重要组成部分,已经有许多人在这个领域进行了深入的研究.由于数据获取不可避免噪音数据,故在挖掘频繁图时考虑近似十分重要.然而许多此前的工作只考虑了子图间编辑距离(Graph Edit Distance,GED)的绝对值,而没有考虑子图间编辑距离与子图大小的相对关系.提出了一种在单图中进行近似频繁子图挖掘的新算法,并在计算近似程度时考虑当前子图的大小.该算法通过对近似频繁子图的大小上限进行预测,并通过局部反单调性进行剪枝,提高了算法的效率.实验表明,该算法能够挖掘出传统算法无法发现的近似频繁子图,且相比对比算法具有更好的时间性能.Abstract: Graph mining is an important part of data mining, and significant research has been dedicated to the field. Due to the inevitability of noise during data acquisition, it is crucial to address the issue of approximation in mining frequent subgraphs. Previous work has only considered the absolute value of the graph edit distance (GED); however, the relative value between the GED and the size of the subgraph should also be considered. Hence, in this paper, we propose a novel algorithm to mine approximate frequent subgraphs in a single graph; this algortihm takes into account the size of current subgraphs for the calculation of approximations. We increase the efficiency of this algorithm by estimating the upper bound of frequent subgraphs, and pruning according to local anti-monotonicity. Experimental results show that this algorithm can find subgraphs that are missed by traditional mining algorithms, and our proposed approach is relatively efficent compared to other algorithms.
-
Key words:
- approximate /
- graph /
- frequent subgraph mining /
- pruning
-
Algorithm 1 SimilarGraphGene Input: $V_g, E_g, T, d= |g|\times(1-\theta)$ Output: All Similar Graph of g 1: $S\leftarrow V_g, E_g, $ 2:for each vertex or edge $x$ in $S$ do 3: Deleted $\leftarrow x$ 4: if $x$ is a vertex then 5: Deleted $\leftarrow$ every edge connected to $x$ 6: end if 7: $m$=isconnected(Deleted, $V_g$, $E_g$) 8: if $m$=true then 9: $k= |T|$ 10: $V_{g'} = V_g$ Deleted$_ V$ 11: $E_{g' }= E_g$ Deleted$_ E$ 12: if $|$Deleted$|$ $ < d$ then 13: SimilarGraphGene($V_{g'}$, $E_{g'}$, $T, d-|$Deleted$|$) 14: if $k= |T|$ then 15: $T\leftarrow g$ 16: end if 17: end if 18: if $|$Deleted$| = d$ then 19: $T\leftarrow g'$ 20: end if 21: end if 22: undo line 3-5 23:end for 24:return $T$ 表 1 样本图上挖掘的时间消耗对比
Tab. 1 Comparison of time required for mining of sample graph
算法 $t$/ms $G7$ $G9$ $G11$ $G13$ 本文算法 53 554 760 1 347 AGraP 1 041 10 226 346 918 46 301 253 表 2 发现子图能力结果
Tab. 2 Experimental results showing the ability to discover subgraphs
图 近似匹配 本文算法 对比算法 点0, 4, 7 点0, 4, 7 点1, 3, 8 点1, 3, 8 点2, 6, 10 点2, 6, 10 (0, 1), (3-4), (7, 8) (0, 1), (3-4), (7, 8) (1, 0, 2), (4, 3), (8, 7, 10) (1, 0, 2), (8, 7, 10) -
[1] YAN X F, HAN J W. gSpan: Graph-based substructure pattern mining[C]//Proceedings of the 2002 IEEE International Conference on Data Mining. 2002: 721-724. [2] ELSEIDY M, ABDELHAMID E, SKIADOPOULOS S, et al. GraMi:Frequent subgraph and pattern mining in a single large graph[J]. Proceedings of the VLDB Endowment, 2014, 7:517-528. doi: 10.14778/2732286.2732289 [3] CHEN C, YAN X F, ZHU F D, et al. gApprox: Mining frequent approximate patterns from a massive network[C]//7th IEEE International Conference on Data Mining. 2007: 445-450. [4] FLORES-GARRIDO M, CARRASCO-OCHOA J A, MARTÍNEZ-TRINIDAD J F. AGraP:An algorithm for mining frequent patterns in a single graph using inexact matching[J]. Knowledge and Information Systems, 2015, 2(44):385-406. http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0234951655/ [5] KURAMOCHI M, KARYPIS G. Finding frequent patterns in a large sparse graph[J]. Data Mining and Knowledge Discovery, 2005, 11(3):243-271. doi: 10.1007/s10618-005-0003-9 [6] NIJSSEN S, KOK J N. A quickstart in frequent structure mining can make a difference[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004: 647-652. [7] ALONSO A G, PAGOLA J E M, CARRASCO-OCHOA J A, et al. Mining frequent connected subgraphs reducing the number of candidates[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008: 365-376. [8] RANU S, SINGH A K. Graphsig: A scalable approach to mining significant subgraphs in large graph databases[C]//2009 IEEE 25th International Conference on Data Engineering. 2009: 844-855. [9] CHENG H, YAN X F, HAN J W. Mining graph patterns[C]//Frequent Pattern Mining, Berlin: Springer, 2014: 307-338. [10] KRISHNA V, SURI N R, ATHITHAN G. A comparative survey of algorithms for frequent subgraph discovery[J]. Current Science, 2011, :190-198. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=Open J-Gate000000074023 [11] CHOUDHURY S, PUROHIT S, LIN P, et al. Percolator: Scalable pattern discovery in dynamic graphs[C]//Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 2018: 759-762. [12] INGALALLI V, IENCO D, PONCELET P. Mining frequent subgraphs in multigraphs[J]. Information Sciences, 2018, 451/452:50-66. doi: 10.1016/j.ins.2018.04.001 [13] ALGULIEV R M, ALIGULIYEV R M, GANJALIYEV F S. Extracting a heterogeneous social network of academic researchers on the Web based on information retrieved from multiple sources[J]. American Journal of Operations Research, 2011, 1(2):33. http://d.old.wanfangdata.com.cn/OAPaper/oai_doaj-articles_d1a43f69b12775f8e61ed6f03a30c453 [14] LIMA JR D P, GIACOMINI H C, TAKEMOTO R M, et al. Patterns of interactions of a large fish-parasite network in a tropical floodplain[J]. Journal of Animal Ecology, 2012, 81(4):905-913.. doi: 10.1111/j.1365-2656.2012.01967.x [15] GAO X B, XIAO B, TAO D C, et al. A survey of graph edit distance[J]. Pattern Analysis and Applications, 2010, 13(1):113-129. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=23f1c6b1111f176881d00ede280bb4eb [16] HIDOVIĆ D, PELILLO M. Metrics for attributed graphs based on the maximal similarity common subgraph[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2004, 18(3):299-313. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=26e34f94bfb8adc8f71be64ac36fc57c [17] DEHMER M, EMMERT-STREIB F. Comparing large graphs efficiently by margins of feature vectors[J]. Applied Mathematics and Computation, 2007, 188(2):1699-1710. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=bb5b4f11bfe4a7efbf2a637396f786e0 [18] HOLDER L B, COOK D J, DJOKO S. Substucture discovery in the SUBDUE system[C]//KDD Workshop, 1994: 169-180. [19] JIA Y, ZHANG J T, HUAN J. An efficient graph-mining method for complicated and noisy data with real-world applications[J]. Knowledge and Information Systems, 2011, 28(2):423-447. doi: 10.1007/s10115-010-0376-y [20] FLORES-GARRIDO M, CARRASCO-OCHOA J A, MARTÍNEZ-TRINIDAD J F. Extensions to AGraP algorithm for finding a reduced set of inexact graph patterns[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2018, 32(1):1860012. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=6062c7c25d7e6f97c25cd19da9333d5c [21] ACOSTA-MENDOZA N, GAGO-ALONSO A, CARRASCO-OCHOA J A, et al. Extension of canonical adjacency matrices for frequent approximate subgraph mining on multi-graph collections[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2017, 31(8):1750025. doi: 10.1142/S0218001417500252 [22] ACOSTA-MENDOZA N, MORALES-GONZáLEZ A, GAGO-ALONSO A, et al. Image classification using frequent approximate subgraphs[C]//Iberoamerican Congress on Pattern Recognition. 2012: 292-299. [23] ACOSTA-MENDOZA N, CARRASCO-OCHOA J A, MARTÍNEZ-TRINIDAD J F, et al. Image clustering based on frequent approximate subgraph mining[C]//Mexican Conference on Pattern Recognition. 2018: 189-198.