A survey of entity matching algorithms in heterogeneous networks
-
摘要: 互联网、物联网和云计算技术的不断融合,使得各行各业信息化程度越来越高,但同时也带来了数据碎片化的问题.数据碎片化的海量性、异构性、隐私性、相依性和低质性等特征,导致了数据可用性较差,利用这些数据难以挖掘出准确而完整的信息.为了更有效地利用数据,实体匹配、融合和消歧变得尤为重要.主要对异构网络中实体匹配算法进行了综述,对实体相似度度量和数据预处理技术进行了梳理;特别针对海量数据,概述了可扩展实体匹配方法的研究进展,综述了运用监督学习和非监督学习两类技术的实体匹配算法.Abstract: The continuous integration of Internet, Internet of Things, and cloud computing technologies has been improving digitization across different industries, but it has also introduced increased data fragmentation. Data fragmentation is characterized by mass, heterogeneity, privacy, dependence, and low quality, resulting in poor data availability. As a result, it is often difficult to obtain accurate and complete information for many analytical tasks. To make effective use of data, entity matching, fusion, and disambiguation are of particular significance. In this paper, we summarize data preprocessing, similarity measurements, and entity matching algorithms of heterogeneous networks. In addition, particularly for large datasets, we investigate scalable entity matching algorithms. Existing entity matching algorithms can be categorized into two groups:supervised and unsupervised learning-based algorithms. We conclude the study with research progress on entity matching and topics for future research.
-
Key words:
- data fusion /
- entity matching /
- record linkage /
- entity resolution
-
表 1 数据预处理类型、方法及优缺点
Tab. 1 Types, methods, advantages, and disadvantages of data processing
数据
类型处理方法 优点 缺点 缺失
数据含有缺失值的数据记录直接删除 简单方便 会去除很多有用的数据,包括很多可能匹配的数据 数值型元素按该列平均值补全,字符型用null补全 数据质量较髙,应用范围最广 损失了一部分精度 聚类方法,基于其他特性判断该样本的整体情况,维持其整体情况不变并补全
比如:一个人在新浪微博和人人网上的好友量都有top3, 且在开心网上的好友数据缺失,那么使用平均值补全其数据是不合理的,因为此人在开心网上好友排名仍然会相对较髙。可以根据这个信息进行缺失值补全[10]数据质量最髙,数据内在特性得以保留 复杂度较高,效率低 噪声
数据聚类之后去除离群点 思想简单,易操作 会去除很多有用的数据 根据特定规则,基于统计,将少数出现的数据变成多数
比如:Modiano不是一些特殊数据且出现多次,Modianoy只出现一次,极有可能Modianoy是拼错的,将Modianoy换成Modiano[11]针对特定属性进行特别处理,得到数据质量较髙 不同的属性可能有不同的规则,比如姓氏Lin出现多次,Ling出现一次,但是不可能将Ling更改为Lin,所以替换场景比较难确定 表 2 参考文献数据库
Tab. 2 Reference database
记录 标题 作者 审稿单位 b1 Object Identification using CRFs Linda Stewart Proc. AAAI-05 b2 Object Identification using CRFs Linda Stewart Proc. 20th AAAI b3 Learning Boolean Formulas Bill Johnson Proc. AAAI-05 b4 Learning of Boolean Expressions William Johnson Proc. 20th AAAI 表 3 算法优缺点总结
Tab. 3 Advantages and disadvantages of different algorithms
类别 主要算法思想 涉及算法 优点 缺点 基于监督学习的实体匹配算法 使用数据集训练算法模型,新的实体对直接使用此模型得到匹配结果。 支持向量机[42]、logistic回归[44]、AdaBoostf[13]、梯度提升树以[27, 37]等 准确性相对较髙 若想得到较好的结果,需要对分类算法的参数进行训练,需要大量的有标签数据,这在实际中难以满足需求 基于规则的算法主要通过人为制定匹配规则。 文献[45, 47-49, 51] 思路简单,比较容易理解 人为定义规则,非常消耗人力且定义规则存在主观性;很难达到最优值。 基于非监督学习的实体匹配算法 基于聚类的算法思想是类别内的数据比较相似,类别之间的数据差异性比较大 K-Means法[29, 53]、层次聚类算法[54-56] 不需要标注数据,人力成本低 算法效果一般较差;簇的个数若人为指定,匹配效果很难最优;若采用自动确定方式,需要引用停止条件,则会增加算法复杂度 基于概率图的算法思想是根据业务需求将实体匹配任务建模成概率模型 概率模型[12, 54, 62-63, 58]、主題模型[55, 68-71]、条件随机场、马尔科夫随机场[29]等 根据概率分布、文本内在的主题信息、关联信息等挖掘潜在信息,考虑实体之间的联系,效果比较好; 可利用隐空间信息,这在没有交叉文本的匹配任务上占有非常大的优势 模型建立难度大;算法复杂性较髙 -
[1] WU X, ZHU X, WU G Q, et al. Data mining with big data[J]. IEEE transactions on knowledge and data engineering, 2014, 26(1):97-107. doi: 10.1109/TKDE.2013.109 [2] 赵国栋.大数据时代的历史机遇:产业变革与数据科学[M].北京:清华大学出版社, 2013. [3] GU B, LI Z, ZHANG X, et al. The interaction between schema matching and record matching in data integration[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):186-199. doi: 10.1109/TKDE.2016.2611577 [4] LI C, JIN L, MEHROTRA S. Supporting efficient record linkage for large data sets using mapping techniques[J]. World Wide Web, 2006, 9(4):557-584. doi: 10.1007/s11280-006-0226-8 [5] WHANG S E, GARCIA-MOLINA H. Incremental entity resolution on rules and data[J]. The VLDB Journal, 2014, 23(1):77-102. doi: 10.1007/s00778-013-0315-0 [6] GETOOR L, MACHANAVAJJHALA A. Entity resolution:theory, practice & open challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12):2018-2019. doi: 10.14778/2367502 [7] TEJADA S, KNOBLOCK C A, Minton S. Learning Object Identification Rules for Information Integration[J]. Information Systems, 2001, 26(8):607-633. doi: 10.1016/S0306-4379(01)00042-4 [8] LEITAO L, CALADO P, HERSCHEL M. Efficient and effective duplicate detection in hierarchical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5):1028-1041. doi: 10.1109/TKDE.2012.60 [9] DUNN H L. Record linkage[J]. American Journal of Public Health and the Nations Health, 1946, 36(12):1412-1416. doi: 10.2105/AJPH.36.12.1412 [10] LIU S, WANG S, ZHU F, et al. Hydra: Large-scale social identity linkage via heterogeneous behavior modeling[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014: 51-62. [11] LIU J, LEI K H, LIU J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 8. [12] KONG C, GAO M, XU C, et al. Entity matching across multiple heterogeneous data sources[C]//International Conference on Database Systems for Advanced Applications. Cham: Springer, 2016: 133-146. [13] KEJRIWAL M, MIRANKER D P. Semi-supervised instance matching using boosted classifiers[C]//European Semantic Web Conference. Cham: Springer, 2015: 388-402. [14] KONDA P, DAS S, SUGANTHAN GC P, et al. Magellan:Toward building entity matching management systems[J]. Proceedings of the VLDB Endowment, 2016, 12(9):1197-1208. http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0220798348/ [15] WHANG S E, MENESTRINA D, KOUTRIKA G, et al. Entity resolution with iterative blocking[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009: 219-232. [16] KARAPIPERIS D, VERYKIOS V S. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):909-921. doi: 10.1109/TKDE.2014.2349916 [17] GRUENHEID A, DONG X L, SRIVASTAVA D. Incremental record linkage[J]. Proceedings of the VLDB Endowment, 2014, 9(7):697-708. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0215029003/ [18] KIM H, LEE D. HARRA: Fast iterative hashed record linkage for large-scale data collections[C]//Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010: 525-536. [19] ZHANG Y, TANG J, YANG Z, et al. Cosnet: Connecting heterogeneous social networks with local and global consistency[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1485-1494. [20] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555. doi: 10.1109/TKDE.2011.127 [21] WANG J, LI G, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 10(4):622-633. http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_3606868 [22] SONG Y, KIMURA T, BATJARGAL B, et al. Cross-language record linkage using word embedding driven metadata similarity measurement[C/OL]//International Semantic Web Conference (Posters & Demos). (2016-09-28)[2018-06-01]. http://ceur-ws.org/vol-1690/paper90.pdf. [23] YANG K H, PENG H T, JIANG J Y, et al. Author name disambiguation for citations using topic and web correlation[C]//International Conference on Theory and Practice of Digital Libraries. Berlin: Springer, 2008: 185-196. [24] VAN Der LOO M P J. The stringdist package for approximate string matching[J]. The R Journal, 2014, 6(1):111-122. http://www.researchgate.net/publication/287728108_The_stringdist_package_for_approximate_string_matching [25] LI C L, SU Y C, LIN T W, et al. Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 2. [26] PELED O, FIRE M, ROKACH L, et al. Entity matching in online social networks[C]//Social Computing (SocialCom), 2013 International Conference on. IEEE, 2013: 339-344. [27] LI J, LIANG X, DING W, et al. Feature engineering and tree modeling for author-paper identification challenge[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 5. [28] PELED O, FIRE M, ROKACH L, et al. Matching entities across online social networks[J]. Neurocomputing, 2016, 210(C):91-106. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=Arxiv000001089484 [29] TANG J, FONG A C M, WANG B, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987. doi: 10.1109/TKDE.2011.13 [30] KOUDAS N, SARAWAGI S, SRIVASTAVA D. Record linkage: similarity measures and algorithms[C]//Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 2006: 802-803. [31] VESDAPUNT N, GARCIA-MOLINA H. Identifying users in social networks with limited information[C]//Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 2015: 627-638. [32] LEE J Y, HUSSAIN R, RIVERA V, et al. Second-level degree-based entity resolution in online social networks[J/OL]. Social Network Analysis and Mining, 2018, 8: 19. (2018-03-16)[2018-06-01]. https://link.springer.com/content/pdf/10.1007%2Fs13278-018-0499-9.pdf. [33] ZHOU X, LIANG X, ZHANG H, et al. Cross-platform identification of anonymous identical users in multiple social media networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2):411-424. doi: 10.1109/TKDE.2015.2485222 [34] ZHANG J, YU P S, ZHOU Z H. Meta-path based multi-network collective link prediction[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014: 1286-1295. [35] LEVIN F H, HEUSER C A. Evaluating the use of social networks in author name disambiguation in digital libraries[J]. Journal of Information and Data Management, 2010(2):183-197. http://www.researchgate.net/publication/220541512_Evaluating_the_Use_of_Social_Networks_in_Author_Name_Disambiguation_in_Digital_Libraries [36] POOJA K M, MONDAL S, CHANDRA J. An unsupervised heuristic based approach for author name disambiguation[C]//Communication Systems and Networks (COMSNETS), 201810th International Conference on. IEEE, 2018: 540-542. [37] ZHONG E, LI L, WANG N, et al. Contextual rule-based feature engineering for author-paper identification[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 6. [38] PEROZZI B, AL-RFOU r, SKIENA S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014: 701-710. [39] GROVER A, LESKOVEC J. node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016: 855-864. [40] TANG J, QU M, WANG M, et al. Line: Large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web, International World Wide Web, Conferences Steering Committee, 2015: 1067-1077. [41] ZHOU X, LIANG X, DU X, et al. Structure based user identification across social networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6):1178-1191. doi: 10.1109/TKDE.2017.2784430 [42] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301.3781v3[cs.CV] 7 Sep 2013. [43] LIU S, WANG S, ZHU F. Structured learning from heterogeneous behavior for social identity linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(7):2005-2019. doi: 10.1109/TKDE.2015.2397434 [44] DEY D. Entity matching in heterogeneous databases:A logistic regression approach[J]. Decision Support Systems, 2008, 44(3):740-747. doi: 10.1016/j.dss.2007.10.007 [45] ZAFARANI R, LIU H. Connecting users across social media sites: a behavioral-modeling approach[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013: 41-49. [46] ZHOU Y, HOWROYD J, DANICIC S, et al. Extending naive bayes classifier with hierarchy feature level information for record linkage[C]//Workshop on Advanced Methodologies for Bayesian Networks. Cham: Springer, 2015: 93-104. [47] LI L, LI J, GAO H. Rule-based method for entity resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):250-263. doi: 10.1109/TKDE.2014.2320713 [48] CHIN W S, ZHUANG Y, JUAN Y C, et al. Effective string processing and matching for author disambiguation[J]. The Journal of Machine Learning Research, 2014, 15(1):3037-3064. http://dl.acm.org/citation.cfm?id=2517295 [49] JIANG Y, LIN C, MENG W, et al. Rule-based deduplication of article records from bibliographic databases[J]. Database, 2014:Article ID bat086. DOI: 10.1093/database/bat086. [50] SETOGUCHI S, ZHU Y, JALBERT J J, et al. Validity of deterministic record linkage using multiple indirect personal identifiers:linking a large registry to claims data[J]. Circulation:Cardiovascular Quality and Outcomes, 2014, 7(3):475-480. doi: 10.1161/CIRCOUTCOMES.113.000294 [51] EKTEFA M, JABAR M A, Sidi F, et al. A threshold-based similarity measure for duplicate detection[C]//Open Systems (ICOS), 2011 IEEE Conference on. IEEE, 2011: 37-41. [52] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297. [53] MONIKA S, ANAND C, GNANAMURTHY R K. Analyzing the User Profile Linkage across Different Social Network Platforms[J]. International Journal of Computer Science and Engineering, 2016, 4(2):1378-1383. doi: 10.1080/09544828.2015.1126702 [54] CEN L, DRAGUT E C, SI L, et al. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013: 741-744. [55] SONG Y, HUANG J, COUNCILL I G, et al. Efficient topic-based unsupervised name disambiguation[C]//Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2007: 342-351. [56] QIAN Y, HU Y, CUI J, et al. Combining machine learning and human judgment in author disambiguation[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011: 1241-1246. [57] FELLEGI I P, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210. doi: 10.1080/01621459.1969.10501049 [58] WINKLER W E. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage[C]//Proceedings of the Section on Survey Research Methods, American Statistical Association. 1988: 667-671. [59] BOYD S, VANDENBERGHE L. Convex Optimization[M]. Cambridge:Cambridge University Press, 2004. [60] GAO M, LIM E P, LO D, et al. CNL: Collective network linkage across heterogeneous social platforms[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2015: 757-762. [61] YAKOUT M, ELMAGARMID A K, Elmeleegy H, et al. Behavior based record linkage[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):439-448. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0211524541/ [62] FANG Y, CHANG M W. Entity linking on microblogs with spatial and temporal signals[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):259-272. http://www.researchgate.net/publication/280313453_Entity_Linking_on_Microblogs_with_Spatial_and_Temporal_Signals [63] HAN H, XU W, ZHA H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations[C]//Proceedings of the 2005 ACM Symposium on Applied Computing. ACM, 2005: 1065-1069. [64] SINGLA P, DOMINGOS P. Collective object identification[C]//International Joint Conference on Artificial Intelligence.[S.l.]. Morgan Kaufmann Publishers Inc, 2005: 1636-1637. [65] BALAJI J, MIN C, JAVED F, et al. Avatar: Large scale entity resolution of heterogeneous user profiles[C]//Proceedings of the 2nd Workshop on Data Management for End-To-End Machine Learning. ACM, 2018: Article No 2. [66] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003(3):993-1022. http://d.old.wanfangdata.com.cn/Periodical/jsjyy201306024 [67] ZHAO W X, JIANG J, WENG J, et al. Comparing twitter and traditional media using topic models[C]//European Conference on Information Retrieval. Berlin: Springer, 2011: 338-349. [68] YANG Y, SUN Y, TANG J, et al. Entity matching across heterogeneous sources[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1395-1404. [69] BHATTACHARYA I, GETOOR L. A latent dirichlet model for unsupervised entity resolution[C]//Proceedings of the 2006 SIAM International Conference on Data Mining, [S.l.]: Society for Industrial and Applied Mathematics, 2006: 47-58. [70] SEN P. Collective context-aware topic models for entity disambiguation[C]//Proceedings of the 21st International Conference on World Wide Web. ACM, 2012: 729-738. [71] TANG J, FANG Z, SUN J. Incorporating social context and domain knowledge for entity recognition[C]//Proceedings of the 24th International Conference on World Wide Web.[S.l.]: International World Wide Web Conferences Steering Committee, 2015: 517-526. [72] STEORTS R C, HALL R, FIENBERG S E. A bayesian approach to graphical record linkage and deduplication[J]. Journal of the American Statistical Association, 2016, 516(111):1660-1672. doi: 10.1080/01621459.2015.1105807 [73] MAURYA A, TELANG R. Bayesian multi-view models for member-job matching and personalized skill recommendations[C]//Big Data, 2017 IEEE International Conference on. IEEE, 2017: 1193-1202.