Approaches for semantic textual similarity
-
摘要: 综述了语义文本相似度计算的最新研究进展, 主要包括基于字符串、基于统计、基于知识库和基于深度学习的方法. 针对每一类方法, 不仅介绍了其中典型的模型和方法, 而且深入探讨了各类方法的优缺点; 并对该领域的常用公开数据集和评估指标进行了整理, 最后讨论并总结了该领域未来可能的研究方向.Abstract: This paper summarizes the latest research progress on semantic textual similarity calculation methods, including string-based, statistics-based, knowledge-based, and deep-learning-based methods. For each method, the paper reviews not only typical models and approaches, but also discusses the respective advantages and disadvantages of each routine; the paper also explores public datasets and evaluation metrics commonly used. Finally, we put forward several possible directions for future research in the field of semantic textual similarity.
-
Key words:
- textual similarity /
- semantic similarity /
- natural language processing /
- knowledge base /
- deep learning
-
表 1 基于字符串的语义文本相似度计算方法
Tab. 1 String-based method for semantic textual similarity
类型 计算方法 基本思想 基于字符串 编辑距离[11, 16] 文本 ${ {S} }_{ {A} } $ 转换到文本${ {S} }_{ {B} } $ 所需的最少编辑操作次数. 编辑操作包括: 增、删、改.LCS[12] $\dfrac{ {2{{K} } } }{ { { {{L} }_{{A} } } + { {{L} }_{{B} } } } }$ 其中 K 表示 ${{S} }_{{A} }$ 与${{S} }_{{B} }$ 的最长公共子序列的长度(可用动态规划求解).
${{L} }_{{A} }$ 与${ {L} }_{ {B} }$ 分别表示${{S} }_{{A} }$ 与$ {{S}}_{{B}} $ 的长度.N-Gram[13] $\dfrac{ {{N} }_{1} }{ {{N} }_{2} }$ 其中 ${{N} }_{1}$ 表示文本$ {{S}}_{{A}} $ 与$ {{S}}_{{B}} $ 共有的 N 元组数量,$ {{N}}_{2} $ 表示总的 N 元组数量.Jaccard[15] $\dfrac{ {{A} }\bigcap {{B} } }{ {{A} }\bigcup {{B} } }$ 其中 A 与 B 分别为表征 $ {{S}}_{{A}} $ 与$ {{S}}_{{B}} $ 的集合, 集合中的元素可以是字符、单词、N 元组等Dice系数 $\dfrac{2\left|{{A} }\bigcap {{B} }\right|}{\left|{{A} }\right|+\left|{{B} }\right|}$ 其中 A 与 B 分别表示文本 $ {{S}}_{{A}} $ 与$ {{S}}_{{B}} $ 的子集合, 分子表示 A 与 B 交集个数的两倍,
分母为 A 与 B 集合中包含的元素个数之和表 2 基于WordNet和同义词词林的计算方法
Tab. 2 Methods based on WordNet and the synonymy thesaurus
分类 相关方法 特点 基于距离 Rada等[32]、Richardon等[33]、Leacock等[34]、
Wu等[35]、Hirst等[36]、Yang等[37]利用概念结构树计算概念间距离, 并结合深度、
层次信息计算语义文本相似度基于信息量 Reshik[38]、Jiang等[39]、Lin等[40] 通过概念包含的信息量进行语义文本相似度计算,
信息量如何定义是该类方法改进的本质基于属性 Lesk[41]、Banerjee等[42]、Pedersen等[43] 利用概念的释义信息及类型信息计算语义文本相似度 混合式 Li等[44]、Bin等[45]、郑志蕴等[46] 结合上述三类方法, 一般该类方法的算法复杂度较高 表 3 基于网络知识的计算方法
Tab. 3 Methods based on network knowledge
表 4 基于无监督学习的方法
Tab. 4 Methods based on unsupervised learning
表 5 监督方法模型简要总结
Tab. 5 Summary of models based on supervised learning
Type Models Sentence encoder layer Interaction layer Similarity measure layer Siamese Network DSSM[70] MLP - Cosine Similarity CNN-DSSM[71] CNN+MLP - Cosine Similarity LSTM-DSSM[72] LSTM - Cosine Similarity Siamese-LSTM[73] LSTM - Manhattan distance+ SVM Lin等[74] BiLSTM+ self-attention - dot product+ MLP Pontes等[75] CNN+LSTM - Manhattan distance InferSent[76] GRU、LSTM、BiGRU、BiLSTM等 - Maxpooling+ MLP Yang等[77] DAN、Transformer - MLP USE[78] Transformer - MLP Interaction-based ABCNN[79] CNN Attention Pooling+ MLP PWIM[80] BiLSTM cosine similarity + Euclidean distance +dot product CNN+ MLP BiMPM[81] BiLSTM matching MLP DIIN[82] self-attention dot product DenseNet DRCN[83] BiLSTM+DenseNet[84] vector addition+ vector subtraction+ vector modulo MLP 表 6 语义文本相似度任务常用公开数据集
Tab. 6 Common datasets for semantic textual similarity
数据集名称 句子对数量(训练、测试) 数据来源 STS2012 5150(3150, 2000) 现存释义集、新闻、视频描述、机器翻译评估集 STS2013 3750(2000, 1750) 新闻、机器翻译评估集 STS2014 14974(7592, 7382) 新闻、论坛、Twitter STS2015 14342(11342, 3000) 新闻、论坛、图像描述 STS2016 1884(1000, 884) 新闻、机器翻译评估集、论坛 STS2017 3210(2210, 1000) 新闻、机器翻译评估集、论坛 TwitterPPDB 51524(42200, 9324) Twitter MSRVID 1500(750, 750) 视频描述 SICK 9927(5000, 4927) 视频描述、图像描述 -
[1] BLOEHDORN S, BASILI R, CAMMISA M, et al. Semantic kernels for text classification based on topological measures of feature similarity [C]//Proceeding of the Sixth International Conference on Data Mining (ICDM’06). 2006: 808-812. [2] TONG Y, GU L. A news text clustering method based on similarity of text labels [C]//International Conference on Advanced Hybrid Information Processing. 2018: 496-503. [3] ATTARDI G, SIMI M, DEI R S. TANL-1: Coreference resolution by parse analysis and similarity clustering [C]//Proceedings of the 5th International Workshop on Semantic Evaluation. 2010: 108-111. [4] DAS A, MANDAL J, DANIAL Z, et al. A novel approach for automatic bengali question answering system using semantic similarity analysis[EB/OL]. (2019-10-23)[2020-07-01]. https://arxiv.org/ftp/arxiv/papers/1910/1910.10758.pdf. [5] AMIR S, TANASESCU A, ZIGHED D A. Sentence similarity based on semantic kernels for intelligent text retrieval [J]. Journal of Intelligent Information Systems, 2017, 48(3): 675-689. [6] SOORI H, PRILEPOK M, PLATOS J, et al. Semantic and similarity measure methods for plagiarism detection of students’ assignments [C]//Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015. 2016: 117-125. [7] VADAPALLI R, KURISINKEL L J, GUPTA M, et al. SSAS: Semantic similarity for abstractive summarization [C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2017: 198-203. [8] QIAN M, LIU J, LI C, et al. A comparative study of English-Chinese translations of court texts by machine and human translators and the Word2Vec based similarity measure’s ability to gauge human evaluation biases [C]//Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 2019: 95-100. [9] MAJUMDER G, PAKRAY P, GELBUKH A, et al. Semantic textual similarity methods, tools, and applications: A survey [J]. Computación y Sistemas, 2016, 20(4): 647-665. [10] 王春柳, 杨永辉, 邓霏, 等. 文本相似度计算方法研究综述 [J]. 情报科学, 2019, 37(3): 158-168. [11] RISTAD, ERIC S, YIANILOS, et al. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5): 522-532. [12] XU X, CHEN L, HE P. Fast sequence similarity computing with LCS on LARPBS [C]//International Symposium on Parallel and Distributed Processing and Applications. 2005: 168-175. [13] KONDRAK G. N-gram similarity and distance [C]// String Processing and Information Retrieval. 2005: 115-126. [14] NIWATTANAKUL S, SINGTHONGCHAI J, NAENUDORN E, et al. Using of Jaccard Coefficient for Keywords Similarity [J]. Lecture Notes in Engineering and Computer Science, 2013, 1(3): 13-15. [15] 车万翔, 刘挺, 秦兵, 等. 基于改进编辑距离的中文相似句子检索 [J]. 高技术通讯, 2004, 14(7): 15-19. [16] SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal processing magazine, 2008, 25(2): 128-131. [17] SALTON G, WONG A, YANG C S, et al. A vector space model for automatic indexing [J]. Communications of The ACM, 1975, 18(11): 613-620. [18] LANDAUER T K, DUMAIS S T. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge [J]. Psychological Review, 1997, 104(2): 211-240. [19] HOFMANN T. Probabilistic latent semantic analysis [J]. Uncertainty in Artificial Intelligence, 1999, 15(6): 289-296. [20] BLEI D M, NG A Y, JORDAN M I, et al. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2012(3): 993-1022. [21] GUO Q L, LI Y M, TANG Q. Similarity computing of documents based on VSM [J]. Application Research of Computers, 2008, 25(11): 3256-3258. [22] LI L. Research and implementation of an improved VSM-based text similarity algorithm [J]. Computer Applications and Software, 2012, 29(2): 282-284. [23] TASI C, HUANG Y, LIU C, et al. Applying VSM and LCS to develop an integrated text retrieval mechanism [J]. Expert Systems With Applications, 2012, 39(4): 3974-3982. [24] 王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算 [J]. 计算机科学, 2013, 40(12): 229-232. [25] XIONG D P, WANG J, LIN H F. An LDA-based approach to finding similar questions for community question answer [J]. Journal of Chinese Information Processing, 2012, 26(5): 40-45. [26] ZHANG C, CHEN L, LI X, et al. Chinese text similarity algorithm based on PST_LDA [J]. Application Research of Computers, 2016, 33(2): 375-377. [27] MIAO Y, YU L, BLUNSOM P, et al. Neural variational inference for text processing [EB/OL]. (2016-01-04)[2020-07-01]. https://arxiv.org/pdf/1511.06038.pdf. [28] LAU J H, BALDWIN T, COHN T, et al. Topically Driven Neural Language Model [C]// Meeting of the Association for Computational Linguistics. 2017: 355-365. [29] MILLER, GEORGE A. WordNet: A lexical database for English [J]. Communications of the Acm, 1995, 38(11): 39-41. [30] 梅家驹, 竺一鸣, 高蕴琦, 等. 同义词词林 [M]. 上海: 上海辞书出版社, 1983. [31] 董振东. 语义关系的表达和知识系统的建造 [J]. 语言文字应用, 1998(3): 76-82. [32] RADA R, MILI H, BICKNELL E J, et al. Development and application of a metric on semantic nets [J]. IEEE Transaction on System Man & Cybernetics, 1989, 19(1):17-30. [33] RICHARDSON R, SMEATON A F. Using WordNet in a knowledge-based approach to information retrieval [EB/OL]. (1995-02-01)[2020-07-01].http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0DDA60E11D37A7DA2777BF162C86760F?doi=10.1.1.48.9324&rep=rep1&type=pdf. [34] LEACOCK C, CHODOROW M. Combining local context and WordNet similarity for word sense identification [M]// FELLBAUM C. WordNet: An Electronic Lexical Database. Massachusetts: MIT Press, 1998. [35] WU Z B. Verb semantics and lexical selection[C]// Acl Proceedings of Annual Meeting on Association for Computational Linguistics. 1994: 133-138. [36] HIRST G, STONGE D. Lexical chains as representations of context for the detection and correction of malapropisms[M]// FELLBAUM C. WordNet: An Electronic Lexical Database. Massachusetts: MIT Press, 1998, 305: 305-332. [37] YANG D, POWERS D M W. Measuring semantic similarity in the taxonomy of WordNet [C]// ACSC’05: Proceedings of the Twenty-eighth Australasian conference on Computer Science. 2005, 38: 315-322. [38] RESNIK P. Using information content to evaluate semantic similarity in a taxonomy [C]// IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence. 1995(1): 448-453. [39] JIANG J J, CONRATH D W. Semantic similarity based on corpus statistics and lexical taxonomy [EB/OL]. (1997-10-01)[2020-07-01]. https://arxiv.org/pdf/cmp-lg/9709008.pdf. [40] LIN D. An information-theoretic definition of similarity [C]//ICML’98: Proceedings of the Fifteenth International Conference on Machine Learning. 1998(7): 296-304. [41] LESK M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone [C]//Proceedings of the 5th Annual International Conference on Systems Documentation. 1986: 24-26. [42] BANERJEE S, PEDERSEN T. An adapted lesk algorithm for word sense disambiguation using WordNet [C]//International Conference on Intelligent Text Processing and Computational Linguistics. 2002: 136-145. [43] PEDERSEN T, PATWARDHAN S, MICHELIZZI J. WordNet : Similarity-Measuring the relatedness of concepts [C]//Demonstrations’04: Demonstration Papers at HLT-NAACL 2004. 2004(5): 38-41. [44] LI Y, BANDAR Z A, MCLEAN D. An approach for measuring semantic similarity between words using multiple information sources [J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882. [45] SHI B, FANG L Y, YAN J Z, et al. Ontology-based measure of semantic similarity between concepts [C]//WCSE '09: Proceedings of the 2009 WRI World Congress on Software Engineering. 2009(2): 109-112. [46] 郑志蕴, 阮春阳, 李伦, 等. 本体语义相似度自适应综合加权算法研究 [J]. 计算机科学, 2016, 43: 242-247. [47] 刘群, 李素建. 基于《知网》 的词汇语义相似度计算 [J]. 中文计算语言学, 2002, 7(2): 59-76. [48] 李峰, 李芳. 中文词语语义相似度计算——基于《知网》2000 [J]. 中文信息学报, 2007, 21(3): 99-105. [49] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词语语义相似度计算 [J]. 中文信息学报, 2008, 22(5): 84-89. [50] STRUBE M, PONZETTO S P. WikiRelate! Computing semantic relatedness using Wikipedia [C]//AAAI'06: Proceedings of the 21st National Conference on Artificial Intelligence. 2006(2): 1419-1424. [51] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using wikipedia-based explicit semantic analysis [C]//IJCAI’07: Proceedings of the 20th International Joint Conference on Artifical Intelligence. 2007(1): 1606-1611. [52] WITTEN I, MILNE D N. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links [C]//Proceedings of AAAI’2008. 2008: 25-30. [53] YEH E, RAMAGE D, MANNING C D, et al. WikiWalk: Random walks on Wikipedia for semantic relatedness [C]//Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. 2009: 41-49. [54] CAMACHO-COLLADOS J, PILEHVAR M T, NAVIGLI R. Nasari: A novel approach to a semantically-aware representation of items [C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 567-577. [55] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. (2013-09-07)[2020-07-01]. https://arxiv.org/pdf/1301.3781.pdf. [56] PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543. [57] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification [EB/OL]. (2016-08-09)[2020-07-01]. https://arxiv.org/pdf/1607.01759.pdf. [58] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations [EB/OL]. (2018-03-22)[2020-07-01]. https://arxiv.org/pdf/1802.05365.pdf. [59] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. (2018-11-05)[2020-07-01]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper.pdf. [60] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008. [61] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-07-01]. https://arxiv.org/pdf/1810.04805.pdf. [62] LE Q, MIKOLOV T. Distributed representations of sentences and documents [C]//ICML’14: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, 32: 1188-1196. [63] PAGLIARDINI M, GUPTA P, JAGGI M. Unsupervised learning of sentence embeddings using compositional n-gram features [EB/OL]. (2018-12-28)[2020-07-01]. https://arxiv.org/pdf/1703.02507.pdf. [64] KIROS R, ZHU Y, SALAKHUTDINOV R R, et al. Skip-thought vectors [C]//Advances in neural information processing systems. 2015: 3294-3302. [65] LOGESWARAN L, LEE H. An efficient framework for learning sentence representations [EB/OL]. (2018-03-07)[2020-07-01]. https://arxiv.org/pdf/1803.02893.pdf. [66] HILL F, CHO K, KORHONEN A. Learning distributed representations of sentences from unlabelled data [EB/OL]. (2016-02-10)[2020-07-01]. https://arxiv.org/pdf/1602.03483.pdf. [67] KUSNER M, SUN Y, KOLKIN N, et al. From word embeddings to document distances [C]//International Conference on Machine Learning. 2015: 957-966. [68] ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings [EB/OL]. (2017-02-04)[2020-07-01]. https://openreview.net/pdf?id=SyK00v5xx. [69] RÜCKLÉ A, EGER S, PEYRARD M, et al. Concatenated power mean word embeddings as universal cross-lingual sentence representations [EB/OL]. (2018-09-12)[2020-07-01]. https://arxiv.org/pdf/1803.01400.pdf. [70] HUANG P S, HE X, GAO J, et al. Learning deep structured semantic models for web search using clickthrough data [C]//Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013: 2333-2338. [71] SHEN Y, HE X, GAO J, et al. A latent semantic model with convolutional-pooling structure for information retrieval[C]//Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 2014: 101-110. [72] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [C]//Advances in nNeural Information Processing Systems. 2012: 1097-1105. [73] PALANGI H, DENG L, SHEN Y, et al. Semantic modelling with long-short-term memory for information retrieval [EB/OL]. (2015-02-27)[2020-07-01]. https://arxiv.org/pdf/1412.6629.pdf. [74] GERS F. Long short-term memory in recurrent neural networks [D]. Lausanne: EPFL, 2001. [75] PONTES E L, HUET S, LINHARES A C, et al. Predicting the semantic textual similarity with siamese CNN and LSTM [EB/OL]. (2018-10-24)[2020-07-01]. https://arxiv.org/pdf/1810.10641.pdf. [76] MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity [C]//AAAI’16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2016(2): 2786-2792 . [77] LIN Z, FENG M, SANTOS C N, et al. A structured self-attentive sentence embedding [EB/OL]. (2017-03-09)[2020-07-01]. https://arxiv.org/pdf/1703.03130.pdf. [78] CONNEAU A, KIELA D, SCHWENK H, et al. Supervised learning of universal sentence representations from natural language inference data [EB/OL]. (2017-07-21)[2020-07-01]. https://arxiv.org/pdf/1705.02364v4.pdf. [79] YIN W, SCHÜTZE H, XIANG B, et al. Abcnn: Attention-based convolutional neural network for modeling sentence pairs [J]. Transactions of the Association for Computational Linguistics, 2016(4): 259-272. [80] HE H, LIN J. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 937-948. [81] WANG Z, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences [C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence Main track. 2017: 4144-4150. [82] GONG Y, LUO H, ZHANG J. Natural language inference over interaction space [EB/OL]. (2018-05-26)[2020-07-01]. https://arxiv.org/pdf/1709.04348.pdf. [83] KIM S, KANG I, KWAK N. Semantic sentence matching with densely-connected recurrent and co-attentive information [C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33: 6586-6593. [84] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708. [85] YANG Y, YUAN S, CER D, et al. Learning semantic textual similarity from conversations [EB/OL]. (2018-04-20)[2020-07-01]. https://arxiv.org/pdf/1804.07754.pdf. [86] CER D, YANG Y, KONG S, et al. Universal sentence encoder [EB/OL]. (2018-04-12)[2020-07-01]. https://arxiv.org/pdf/1803.11175.pdf. [87] CHEN G, SHI X, CHEN M, et al. Text similarity semantic calculation based on deep reinforcement learning [J]. International Journal of Security and Networks, 2020, 15(1): 59-66.