中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

异构网络中实体匹配算法综述

李娜 金冈增 周晓旭 郑建兵 高明

李娜, 金冈增, 周晓旭, 郑建兵, 高明. 异构网络中实体匹配算法综述[J]. 华东师范大学学报(自然科学版), 2018, (5): 41-55. doi: 10.3969/j.issn.1000-5641.2018.05.004
引用本文: 李娜, 金冈增, 周晓旭, 郑建兵, 高明. 异构网络中实体匹配算法综述[J]. 华东师范大学学报(自然科学版), 2018, (5): 41-55. doi: 10.3969/j.issn.1000-5641.2018.05.004
LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming. A survey of entity matching algorithms in heterogeneous networks[J]. Journal of East China Normal University (Natural Sciences), 2018, (5): 41-55. doi: 10.3969/j.issn.1000-5641.2018.05.004
Citation: LI Na, JIN Gang-zeng, ZHOU Xiao-xu, ZHENG Jian-bing, GAO Ming. A survey of entity matching algorithms in heterogeneous networks[J]. Journal of East China Normal University (Natural Sciences), 2018, (5): 41-55. doi: 10.3969/j.issn.1000-5641.2018.05.004

异构网络中实体匹配算法综述

doi: 10.3969/j.issn.1000-5641.2018.05.004
基金项目: 

国家重点研发计划项目 2016YFB1000905

国家自然科学基金广东省联合重点项目 U1401256

国家自然科学基金 61672234

国家自然科学基金 61502236

国家自然科学基金 61472321

上海市科技兴农推广项目 T20170303

详细信息
    作者简介:

    李娜, 女, 硕士研究生, 研究方向为数据挖掘.E-mail:nali0606@foxmail.com

    通讯作者:

    郑建兵, 男, 博士, 研究方向为信息处理技术.E-mail:zhengjb@js.chinamobile.com

  • 中图分类号: TP391

A survey of entity matching algorithms in heterogeneous networks

  • 摘要: 互联网、物联网和云计算技术的不断融合,使得各行各业信息化程度越来越高,但同时也带来了数据碎片化的问题.数据碎片化的海量性、异构性、隐私性、相依性和低质性等特征,导致了数据可用性较差,利用这些数据难以挖掘出准确而完整的信息.为了更有效地利用数据,实体匹配、融合和消歧变得尤为重要.主要对异构网络中实体匹配算法进行了综述,对实体相似度度量和数据预处理技术进行了梳理;特别针对海量数据,概述了可扩展实体匹配方法的研究进展,综述了运用监督学习和非监督学习两类技术的实体匹配算法.
  • 图  1  单数据源实体匹配

    Fig.  1  Entity matching within a single data source

    图  2  多数据源实体匹配

    Fig.  2  Entity matching between multiple data sources

    图  3  使用分块技术的候选集生成过程

    Fig.  3  The process of generating pair-wise candidate sets using blocking technology

    图  4  基于监督学习的匹配算法流程图

    Fig.  4  Flowchart for supervised learning-based entity matching algorithms

    表  1  数据预处理类型、方法及优缺点

    Tab.  1  Types, methods, advantages, and disadvantages of data processing

    数据
    类型
    处理方法优点缺点
    缺失
    数据
    含有缺失值的数据记录直接删除简单方便会去除很多有用的数据,包括很多可能匹配的数据
    数值型元素按该列平均值补全,字符型用null补全数据质量较髙,应用范围最广损失了一部分精度
    聚类方法,基于其他特性判断该样本的整体情况,维持其整体情况不变并补全
    比如:一个人在新浪微博和人人网上的好友量都有top3, 且在开心网上的好友数据缺失,那么使用平均值补全其数据是不合理的,因为此人在开心网上好友排名仍然会相对较髙。可以根据这个信息进行缺失值补全[10]
    数据质量最髙,数据内在特性得以保留复杂度较高,效率低
    噪声
    数据
    聚类之后去除离群点思想简单,易操作会去除很多有用的数据
    根据特定规则,基于统计,将少数出现的数据变成多数
    比如:Modiano不是一些特殊数据且出现多次,Modianoy只出现一次,极有可能Modianoy是拼错的,将Modianoy换成Modiano[11]
    针对特定属性进行特别处理,得到数据质量较髙不同的属性可能有不同的规则,比如姓氏Lin出现多次,Ling出现一次,但是不可能将Ling更改为Lin,所以替换场景比较难确定
    下载: 导出CSV

    表  2  参考文献数据库

    Tab.  2  Reference database

    记录 标题 作者审稿单位
    b1Object Identification using CRFsLinda StewartProc. AAAI-05
    b2Object Identification using CRFsLinda StewartProc. 20th AAAI
    b3Learning Boolean FormulasBill JohnsonProc. AAAI-05
    b4Learning of Boolean ExpressionsWilliam JohnsonProc. 20th AAAI
    下载: 导出CSV

    表  3  算法优缺点总结

    Tab.  3  Advantages and disadvantages of different algorithms

    类别主要算法思想涉及算法优点缺点
    基于监督学习的实体匹配算法使用数据集训练算法模型,新的实体对直接使用此模型得到匹配结果。支持向量机[42]、logistic回归[44]、AdaBoostf[13]、梯度提升树以[27, 37]准确性相对较髙若想得到较好的结果,需要对分类算法的参数进行训练,需要大量的有标签数据,这在实际中难以满足需求
    基于规则的算法主要通过人为制定匹配规则。文献[45, 47-49, 51]思路简单,比较容易理解人为定义规则,非常消耗人力且定义规则存在主观性;很难达到最优值。
    基于非监督学习的实体匹配算法基于聚类的算法思想是类别内的数据比较相似,类别之间的数据差异性比较大K-Means法[29, 53]、层次聚类算法[54-56]不需要标注数据,人力成本低算法效果一般较差;簇的个数若人为指定,匹配效果很难最优;若采用自动确定方式,需要引用停止条件,则会增加算法复杂度
    基于概率图的算法思想是根据业务需求将实体匹配任务建模成概率模型概率模型[12, 54, 62-63, 58]、主題模型[55, 68-71]、条件随机场、马尔科夫随机场[29]根据概率分布、文本内在的主题信息、关联信息等挖掘潜在信息,考虑实体之间的联系,效果比较好; 可利用隐空间信息,这在没有交叉文本的匹配任务上占有非常大的优势模型建立难度大;算法复杂性较髙
    下载: 导出CSV
  • [1] WU X, ZHU X, WU G Q, et al. Data mining with big data[J]. IEEE transactions on knowledge and data engineering, 2014, 26(1):97-107. doi:  10.1109/TKDE.2013.109
    [2] 赵国栋.大数据时代的历史机遇:产业变革与数据科学[M].北京:清华大学出版社, 2013.
    [3] GU B, LI Z, ZHANG X, et al. The interaction between schema matching and record matching in data integration[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):186-199. doi:  10.1109/TKDE.2016.2611577
    [4] LI C, JIN L, MEHROTRA S. Supporting efficient record linkage for large data sets using mapping techniques[J]. World Wide Web, 2006, 9(4):557-584. doi:  10.1007/s11280-006-0226-8
    [5] WHANG S E, GARCIA-MOLINA H. Incremental entity resolution on rules and data[J]. The VLDB Journal, 2014, 23(1):77-102. doi:  10.1007/s00778-013-0315-0
    [6] GETOOR L, MACHANAVAJJHALA A. Entity resolution:theory, practice & open challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12):2018-2019. doi:  10.14778/2367502
    [7] TEJADA S, KNOBLOCK C A, Minton S. Learning Object Identification Rules for Information Integration[J]. Information Systems, 2001, 26(8):607-633. doi:  10.1016/S0306-4379(01)00042-4
    [8] LEITAO L, CALADO P, HERSCHEL M. Efficient and effective duplicate detection in hierarchical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5):1028-1041. doi:  10.1109/TKDE.2012.60
    [9] DUNN H L. Record linkage[J]. American Journal of Public Health and the Nations Health, 1946, 36(12):1412-1416. doi:  10.2105/AJPH.36.12.1412
    [10] LIU S, WANG S, ZHU F, et al. Hydra: Large-scale social identity linkage via heterogeneous behavior modeling[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014: 51-62.
    [11] LIU J, LEI K H, LIU J Y, et al. Ranking-based name matching for author disambiguation in bibliographic data[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 8.
    [12] KONG C, GAO M, XU C, et al. Entity matching across multiple heterogeneous data sources[C]//International Conference on Database Systems for Advanced Applications. Cham: Springer, 2016: 133-146.
    [13] KEJRIWAL M, MIRANKER D P. Semi-supervised instance matching using boosted classifiers[C]//European Semantic Web Conference. Cham: Springer, 2015: 388-402.
    [14] KONDA P, DAS S, SUGANTHAN GC P, et al. Magellan:Toward building entity matching management systems[J]. Proceedings of the VLDB Endowment, 2016, 12(9):1197-1208. http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0220798348/
    [15] WHANG S E, MENESTRINA D, KOUTRIKA G, et al. Entity resolution with iterative blocking[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009: 219-232.
    [16] KARAPIPERIS D, VERYKIOS V S. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):909-921. doi:  10.1109/TKDE.2014.2349916
    [17] GRUENHEID A, DONG X L, SRIVASTAVA D. Incremental record linkage[J]. Proceedings of the VLDB Endowment, 2014, 9(7):697-708. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0215029003/
    [18] KIM H, LEE D. HARRA: Fast iterative hashed record linkage for large-scale data collections[C]//Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010: 525-536.
    [19] ZHANG Y, TANG J, YANG Z, et al. Cosnet: Connecting heterogeneous social networks with local and global consistency[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1485-1494.
    [20] CHRISTEN P. A survey of indexing techniques for scalable record linkage and deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555. doi:  10.1109/TKDE.2011.127
    [21] WANG J, LI G, YU J X, et al. Entity matching:How similar is similar[J]. Proceedings of the VLDB Endowment, 2011, 10(4):622-633. http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_3606868
    [22] SONG Y, KIMURA T, BATJARGAL B, et al. Cross-language record linkage using word embedding driven metadata similarity measurement[C/OL]//International Semantic Web Conference (Posters & Demos). (2016-09-28)[2018-06-01]. http://ceur-ws.org/vol-1690/paper90.pdf.
    [23] YANG K H, PENG H T, JIANG J Y, et al. Author name disambiguation for citations using topic and web correlation[C]//International Conference on Theory and Practice of Digital Libraries. Berlin: Springer, 2008: 185-196.
    [24] VAN Der LOO M P J. The stringdist package for approximate string matching[J]. The R Journal, 2014, 6(1):111-122. http://www.researchgate.net/publication/287728108_The_stringdist_package_for_approximate_string_matching
    [25] LI C L, SU Y C, LIN T W, et al. Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 2.
    [26] PELED O, FIRE M, ROKACH L, et al. Entity matching in online social networks[C]//Social Computing (SocialCom), 2013 International Conference on. IEEE, 2013: 339-344.
    [27] LI J, LIANG X, DING W, et al. Feature engineering and tree modeling for author-paper identification challenge[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 5.
    [28] PELED O, FIRE M, ROKACH L, et al. Matching entities across online social networks[J]. Neurocomputing, 2016, 210(C):91-106. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=Arxiv000001089484
    [29] TANG J, FONG A C M, WANG B, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987. doi:  10.1109/TKDE.2011.13
    [30] KOUDAS N, SARAWAGI S, SRIVASTAVA D. Record linkage: similarity measures and algorithms[C]//Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 2006: 802-803.
    [31] VESDAPUNT N, GARCIA-MOLINA H. Identifying users in social networks with limited information[C]//Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 2015: 627-638.
    [32] LEE J Y, HUSSAIN R, RIVERA V, et al. Second-level degree-based entity resolution in online social networks[J/OL]. Social Network Analysis and Mining, 2018, 8: 19. (2018-03-16)[2018-06-01]. https://link.springer.com/content/pdf/10.1007%2Fs13278-018-0499-9.pdf.
    [33] ZHOU X, LIANG X, ZHANG H, et al. Cross-platform identification of anonymous identical users in multiple social media networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2):411-424. doi:  10.1109/TKDE.2015.2485222
    [34] ZHANG J, YU P S, ZHOU Z H. Meta-path based multi-network collective link prediction[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014: 1286-1295.
    [35] LEVIN F H, HEUSER C A. Evaluating the use of social networks in author name disambiguation in digital libraries[J]. Journal of Information and Data Management, 2010(2):183-197. http://www.researchgate.net/publication/220541512_Evaluating_the_Use_of_Social_Networks_in_Author_Name_Disambiguation_in_Digital_Libraries
    [36] POOJA K M, MONDAL S, CHANDRA J. An unsupervised heuristic based approach for author name disambiguation[C]//Communication Systems and Networks (COMSNETS), 201810th International Conference on. IEEE, 2018: 540-542.
    [37] ZHONG E, LI L, WANG N, et al. Contextual rule-based feature engineering for author-paper identification[C]//Proceedings of the 2013 KDD Cup 2013 Workshop. ACM, 2013: Article No 6.
    [38] PEROZZI B, AL-RFOU r, SKIENA S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014: 701-710.
    [39] GROVER A, LESKOVEC J. node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016: 855-864.
    [40] TANG J, QU M, WANG M, et al. Line: Large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web, International World Wide Web, Conferences Steering Committee, 2015: 1067-1077.
    [41] ZHOU X, LIANG X, DU X, et al. Structure based user identification across social networks[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6):1178-1191. doi:  10.1109/TKDE.2017.2784430
    [42] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301.3781v3[cs.CV] 7 Sep 2013.
    [43] LIU S, WANG S, ZHU F. Structured learning from heterogeneous behavior for social identity linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(7):2005-2019. doi:  10.1109/TKDE.2015.2397434
    [44] DEY D. Entity matching in heterogeneous databases:A logistic regression approach[J]. Decision Support Systems, 2008, 44(3):740-747. doi:  10.1016/j.dss.2007.10.007
    [45] ZAFARANI R, LIU H. Connecting users across social media sites: a behavioral-modeling approach[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013: 41-49.
    [46] ZHOU Y, HOWROYD J, DANICIC S, et al. Extending naive bayes classifier with hierarchy feature level information for record linkage[C]//Workshop on Advanced Methodologies for Bayesian Networks. Cham: Springer, 2015: 93-104.
    [47] LI L, LI J, GAO H. Rule-based method for entity resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1):250-263. doi:  10.1109/TKDE.2014.2320713
    [48] CHIN W S, ZHUANG Y, JUAN Y C, et al. Effective string processing and matching for author disambiguation[J]. The Journal of Machine Learning Research, 2014, 15(1):3037-3064. http://dl.acm.org/citation.cfm?id=2517295
    [49] JIANG Y, LIN C, MENG W, et al. Rule-based deduplication of article records from bibliographic databases[J]. Database, 2014:Article ID bat086. DOI: 10.1093/database/bat086.
    [50] SETOGUCHI S, ZHU Y, JALBERT J J, et al. Validity of deterministic record linkage using multiple indirect personal identifiers:linking a large registry to claims data[J]. Circulation:Cardiovascular Quality and Outcomes, 2014, 7(3):475-480. doi:  10.1161/CIRCOUTCOMES.113.000294
    [51] EKTEFA M, JABAR M A, Sidi F, et al. A threshold-based similarity measure for duplicate detection[C]//Open Systems (ICOS), 2011 IEEE Conference on. IEEE, 2011: 37-41.
    [52] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
    [53] MONIKA S, ANAND C, GNANAMURTHY R K. Analyzing the User Profile Linkage across Different Social Network Platforms[J]. International Journal of Computer Science and Engineering, 2016, 4(2):1378-1383. doi:  10.1080/09544828.2015.1126702
    [54] CEN L, DRAGUT E C, SI L, et al. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013: 741-744.
    [55] SONG Y, HUANG J, COUNCILL I G, et al. Efficient topic-based unsupervised name disambiguation[C]//Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2007: 342-351.
    [56] QIAN Y, HU Y, CUI J, et al. Combining machine learning and human judgment in author disambiguation[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011: 1241-1246.
    [57] FELLEGI I P, SUNTER A B. A theory for record linkage[J]. Journal of the American Statistical Association, 1969, 64(328):1183-1210. doi:  10.1080/01621459.1969.10501049
    [58] WINKLER W E. Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage[C]//Proceedings of the Section on Survey Research Methods, American Statistical Association. 1988: 667-671.
    [59] BOYD S, VANDENBERGHE L. Convex Optimization[M]. Cambridge:Cambridge University Press, 2004.
    [60] GAO M, LIM E P, LO D, et al. CNL: Collective network linkage across heterogeneous social platforms[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2015: 757-762.
    [61] YAKOUT M, ELMAGARMID A K, Elmeleegy H, et al. Behavior based record linkage[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):439-448. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0211524541/
    [62] FANG Y, CHANG M W. Entity linking on microblogs with spatial and temporal signals[J]. Transactions of the Association for Computational Linguistics, 2014, 2(1):259-272. http://www.researchgate.net/publication/280313453_Entity_Linking_on_Microblogs_with_Spatial_and_Temporal_Signals
    [63] HAN H, XU W, ZHA H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations[C]//Proceedings of the 2005 ACM Symposium on Applied Computing. ACM, 2005: 1065-1069.
    [64] SINGLA P, DOMINGOS P. Collective object identification[C]//International Joint Conference on Artificial Intelligence.[S.l.]. Morgan Kaufmann Publishers Inc, 2005: 1636-1637.
    [65] BALAJI J, MIN C, JAVED F, et al. Avatar: Large scale entity resolution of heterogeneous user profiles[C]//Proceedings of the 2nd Workshop on Data Management for End-To-End Machine Learning. ACM, 2018: Article No 2.
    [66] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003(3):993-1022. http://d.old.wanfangdata.com.cn/Periodical/jsjyy201306024
    [67] ZHAO W X, JIANG J, WENG J, et al. Comparing twitter and traditional media using topic models[C]//European Conference on Information Retrieval. Berlin: Springer, 2011: 338-349.
    [68] YANG Y, SUN Y, TANG J, et al. Entity matching across heterogeneous sources[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1395-1404.
    [69] BHATTACHARYA I, GETOOR L. A latent dirichlet model for unsupervised entity resolution[C]//Proceedings of the 2006 SIAM International Conference on Data Mining, [S.l.]: Society for Industrial and Applied Mathematics, 2006: 47-58.
    [70] SEN P. Collective context-aware topic models for entity disambiguation[C]//Proceedings of the 21st International Conference on World Wide Web. ACM, 2012: 729-738.
    [71] TANG J, FANG Z, SUN J. Incorporating social context and domain knowledge for entity recognition[C]//Proceedings of the 24th International Conference on World Wide Web.[S.l.]: International World Wide Web Conferences Steering Committee, 2015: 517-526.
    [72] STEORTS R C, HALL R, FIENBERG S E. A bayesian approach to graphical record linkage and deduplication[J]. Journal of the American Statistical Association, 2016, 516(111):1660-1672. doi:  10.1080/01621459.2015.1105807
    [73] MAURYA A, TELANG R. Bayesian multi-view models for member-job matching and personalized skill recommendations[C]//Big Data, 2017 IEEE International Conference on. IEEE, 2017: 1193-1202.
  • 加载中
图(4) / 表(3)
计量
  • 文章访问数:  189
  • HTML全文浏览量:  83
  • PDF下载量:  272
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-07-04
  • 刊出日期:  2018-09-25

目录

    /

    返回文章
    返回