中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于神经网络语言模型的分布式词向量研究进展

郁可人 傅云斌 董启文

郁可人, 傅云斌, 董启文. 基于神经网络语言模型的分布式词向量研究进展[J]. 华东师范大学学报(自然科学版), 2017, (5): 52-65, 79. doi: 10.3969/j.issn.1000-5641.2017.05.006
引用本文: 郁可人, 傅云斌, 董启文. 基于神经网络语言模型的分布式词向量研究进展[J]. 华东师范大学学报(自然科学版), 2017, (5): 52-65, 79. doi: 10.3969/j.issn.1000-5641.2017.05.006
YU Ke-ren, FU Yun-bin, DONG Qi-wen. Survey on distributed word embeddings based on neural network language models[J]. Journal of East China Normal University (Natural Sciences), 2017, (5): 52-65, 79. doi: 10.3969/j.issn.1000-5641.2017.05.006
Citation: YU Ke-ren, FU Yun-bin, DONG Qi-wen. Survey on distributed word embeddings based on neural network language models[J]. Journal of East China Normal University (Natural Sciences), 2017, (5): 52-65, 79. doi: 10.3969/j.issn.1000-5641.2017.05.006

基于神经网络语言模型的分布式词向量研究进展

doi: 10.3969/j.issn.1000-5641.2017.05.006
基金项目: 

国家重点研发计划 2016YFB1000905

国家自然科学基金广东省联合重点项目 U1401256

国家自然科学基金 61672234

国家自然科学基金 61402177

详细信息
    作者简介:

    郁可人, 男, 硕士研究生, 研究方向为自然语言处理.E-mail:yu_void@qq.com

    通讯作者:

    傅云斌, 男, 博士后, 研究方向为数据科学与机器学习.E-mail:fuyunbin2012@163.com

  • 分布式表示(distributed representation)有别于分布表示(distributional representation).分布式表示强调用多个维度分布地表示对象, 为了区别于独热表示提出; 而分布表示强调基于分布理论, 即共现上下文影响词义.绝大多数分布式词向量都是分布表示的, 而分布表示都基于分布假说.
  • 中图分类号: TP391

Survey on distributed word embeddings based on neural network language models

  • 摘要: 单词向量化是自然语言处理领域中的重要研究课题之一,其核心是对文本中的单词建模,用一个较低维的向量来表征每个单词.生成词向量的方式有很多,目前性能最佳的是基于神经网络语言模型生成的分布式词向量,Google公司在2012年推出的Word2vec开源工具就是其中之一.分布式词向量已被应用于聚类、命名实体识别、词性分析等自然语言处理任务中,它的性能依赖于神经网络语言模型本身的性能,并与语言模型处理的具体任务有关.本文从三个方面介绍基于神经网络的分布式词向量,包括:经典神经网络语言模型的构建方法;对语言模型中存在的多分类问题的优化方法;如何利用辅助结构训练词向量.
    1) 利益冲突:
    1)  分布式表示(distributed representation)有别于分布表示(distributional representation).分布式表示强调用多个维度分布地表示对象, 为了区别于独热表示提出; 而分布表示强调基于分布理论, 即共现上下文影响词义.绝大多数分布式词向量都是分布表示的, 而分布表示都基于分布假说.
  • 表  1  模型优缺点比对表

    Tab.  1  Comparison of the advantages and disadvantages of models

    模型优点与适用场景缺点
    skip-gram+NEG使用周边单词作为上下文, 且不关注上下文的顺序, 从而可以产生大量上下文目标单词样本, 收敛很快.结构简单, 参数个数少.最适合训练只关注单词而不关注语境的任务, 因此在单词相似度和类推任务中都表现较优不关注上下文的顺序, 因此无法辨识出上下文蕴含的逻辑关系.泛用性不高, 在很多需要语义理解的NLP任务中都不适用
    RNNLM非常关注上下文之间的顺序, 并且有能力抓取到距离较远的两个单词之间的关联.在处理涉及语义理解的任务时优势极大, 配合Attention机制[61]效果更佳变式极多, 模型调优耗时较大.不同变式模型复杂度不同, 参数个数多
    C & W使用了级联的上下文词向量, 被设计用于解决复杂NLP任务, 因此其泛用性较强.相比RNNLM, 模型复杂度较低, 训练速度更快对复杂NLP任务泛用性不及RNNLM.完成面向单词的任务效果不及skip-gram模型
    下载: 导出CSV
  • [1] HARRIS Z S. Distributional structure[J]. Word, 1954, 10(2/3):146-162. doi:  10.1080/00437956.1954.11659520
    [2] FIRTH J R. A synopsis of linguistic theory, 1930-1955[J]. Studies in linguistic analysis, 1957(S):1-31 http://dingo.sbs.arizona.edu/~langendoen/ReviewOfFirth.pdf
    [3] 来斯惟. 基于神经网络的词和文档语义向量表示方法研究[D]. 北京: 中国科学院大学, 2016.
    [4] TURIAN J, RATINOV L, BENGIO Y. Word representations:a simple and general method for semi-supervised learning[C]//ACL 2010, Proceedings of the Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden. DBLP, 2010:384-394.
    [5] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6):391. doi:  10.1002/(ISSN)1097-4571
    [6] PENNINGTON J, SOCHER R, MANNING C. Glove:Global vectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing, 2014:1532-1543.
    [7] BROWN P F, DESOUZA P V, MERCER R L, et al. Class-based n-gram models of natural language[J]. Computational linguistics, 1992, 18(4):467-479. http://dl.acm.org/citation.cfm?id=176316&picked=formats
    [8] GUO J, CHE W, WANG H, et al. Revisiting embedding features for simple semi-supervised learning[C]//Conference on Empirical Methods in Natural Language Processing, 2014:110-120.
    [9] CHEN X, XU L, LIU Z, et al. Joint learning of character and word embeddings[C]//International Conference on Artificial Intelligence. AAAI Press, 2015:1236-1242.
    [10] HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986:12.
    [11] MⅡKKULAINEN R, DYER M G. Natural language processing with modular neural networks and distributed lexicon[C]//Cognitive Science, 1991:343-399.
    [12] ALEXRUDNICKY. Can artificial neural networks learn language models?[C]//International Conference on Spoken Language Processing. DBLP, 2000:202-205.
    [13] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6):1137-1155. http://machinelearning.wustl.edu/mlpapers/paper_files/BengioDVJ03.pdf
    [14] MNIH A, HINTON G. Three new graphical models for statistical language modelling[C]//Machine Learning, Proceedings of the Twenty-Fourth International Conference. DBLP, 2007:641-648.
    [15] SUTSKEVER I, HINTON G E. Learning multilevel distributed representations for high-dimensional sequences[J]. Journal of Machine Learning Research, 2007(2):548-555. http://dblp.uni-trier.de/db/journals/jmlr/jmlrp2.html#SutskeverH07
    [16] MNIH A, HINTON G. A scalable hierarchical distributed language model[C]//Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December. DBLP, 2008:1081-1088.
    [17] MNIH A, KAVUKCUOGLU K. Learning word embeddings efficiently with noise-contrastive estimation[C]//Advances in Neural Information Processing Systems, 2013:2265-2273.
    [18] MIKOLOV T, KARAFIÁT M, BURGET L, et al. Recurrent neural network based language model[C]//INTERSPEECH 2010, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September. DBLP, 2010:1045-1048.
    [19] MIKOLOV T, KOMBRINK S, DEORAS A, et al. Rnnlm-recurrent neural network language modeling toolkit[C]//Processingof the 2011 ASRU Workshop, 2011:196-201.
    [20] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on Neural Networks, 2002, 5(2):157-166. http://ieeexplore.ieee.org/xpl/abstractKeywords.jsp?reload=true&arnumber=279181&contentType=Journals+%26+Magazines
    [21] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780. doi:  10.1162/neco.1997.9.8.1735
    [22] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation[C]//Empirical Methods in Natural Language Processing, 2014:1724-1734.
    [23] CHO K, VAN MERRIËNBOER B, BAHDANAU D, et al. On the properties of neural machine translation:Encoder-decoder approaches[J]. ArXiv preprint arXiv:1409.1259, 2014.
    [24] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. ArXiv preprint arXiv:1412.3555, 2014.
    [25] GREFF K, SRIVASTAVA R K, KOUTNÍK J, et al. LSTM:A search space odyssey[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015(99):1-11. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7508408
    [26] JOZEFOWICZ R, ZAREMBA W, SUTSKEVER I, et al. An empirical exploration of recurrent network architectures[C]//International Conference on Machine Learning, 2015:2342-2350.
    [27] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. ArXiv preprint arXiv:1301.3781, 2013.
    [28] MORIN F, BENGIO Y. Hierarchical probabilistic neural network language model[C]//Aistats, 2005:246-252.
    [29] GOODMAN J. Classes for fast maximum entropy training[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2001:561-564.
    [30] FELLBAUM C, MILLER G. WordNet:An Electronic Lexical Database[M].Cambridge, MA:MIT Press, 1998.
    [31] MNIH A, HINTON G. A scalable hierarchical distributed language model[C]//International Conference on Neural Information Processing Systems. Curran Associates Inc, 2008:1081-1088.
    [32] LE H S, OPARIN I, ALLAUZEN A, et al. Structured Output Layer neural network language model[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2011:5524-5527.
    [33] MIKOLOV T, KOMBRINK S, BURGET L, et al. Extensions of recurrent neural network language model[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2011:5528-5531.
    [34] COLLOBERT R, WESTON J. A unified architecture for natural language processing:Deep neural networks with multitask learning[C]//International Conference. DBLP, 2008:160-167.
    [35] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural Language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537. http://www.inf.ed.ac.uk/teaching/courses/tnlp/2014/Ryan.pdf
    [36] GUTMANN M, HYVÄRINEN A. Noise-contrastive estimation:A new estimationp rinciple for unnormalized statistical models[J]. Journal of Machine Learning Research, 2010(9):297-304.
    [37] GUTMANN M U, HYVARINEN A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics[J]. Journal of Machine Learning Research, 2012, 13(1):307-361.
    [38] MNIH A, TEH Y W. A fast and simple algorithm for training neural probabilistic language models[C]//International Conference on Machine Learning, 2012:1751-1758.
    [39] BENGIO Y, SENÉCAL J S. Quick Training of Probabilistic Neural Nets by Impo rtance Sampling[C]//AISTATS, 2003:1-9.
    [40] ZOPH B, VASWANI A, MAY J, et al. Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2016:1217-1222.
    [41] DYER C. Notes on noise contrastive estimation and negative sampling[J]. ArXiv preprint arXiv:1410.8251, 2014.
    [42] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119. http://www.cs.wayne.edu/~mdong/Haotian_WordEmbedding.pptx
    [43] CHEN W, GRANGIER D, AULI M, et al. Strategies for training large vocabulary neural language models[C]//Meeting of the Association for Computational Linguistics, 2015:1975-1985.
    [44] DEVLIN J, ZBIB R, HUANG Z, et al. Fast and robust neural network joint models for statistical machine translation[C]//Meeting of the Association for Computational Linguistics, 2014:1370-1380.
    [45] ANDREAS J, DAN K. When and why are log-linear models self-normalizing?[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2015:244-249.
    [46] MIKOLOV T, KOPECKY J, BURGET L, et al. Neural network based language models for highly inflective languages[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009:4725-4728.
    [47] SANTOS C D, ZADROZNY B. Learning character-level representations for part-of-speech tagging[C]//Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014:1818-1826.
    [48] COTTERELL R, SCHÜTZE H. Morphological word-embeddings[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, 2015:1287-1292.
    [49] BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[J]. ArXiv preprint arXiv:1607.04606, 2016.
    [50] LI Y, LI W, SUN F, et al. Component-enhanced Chinese character embeddings[C]//Empirical Methods in Natural Language Processing, 2015:829-834.
    [51] YU M, DREDZE M. Improving lexical embeddings with semantic knowledge[C]//Meeting of the Association for Computational Linguistics, 2014:545-550.
    [52] WANG Z, ZHANG J, FENG J, et al. Knowledge graph and text jointly embedding[C]//Conference on Empirical Methods in Natural Language Processing, 2014:1591-1601.
    [53] REISINGER J, MOONEY R J. Multi-prototype vector-space models of word meaning[C]//Human Language Technologies:The 2010 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010:109-117.
    [54] HUANG E H, SOCHER R, MANNING C D, et al. Improving word representations via global context and multiple word prototypes[C]//Meeting of the Association for Computational Linguistics:Long Papers. Association for Computational Linguistics, 2012:873-882.
    [55] VILNIS L, MCCALLUM A. Word representations via gaussian embedding[R]. University of Massachusetts Amherst, 2014.
    [56] HILL F, REICHART R, KORHONEN A, et al. Simlex-999:Evaluating semantic models with genuine similarity estimation[J]. Computational Linguistics, 2015, 41(4):665-695. doi:  10.1162/COLI_a_00237
    [57] FINKELSTEIN R L. Placing search in context:the concept revisited[J]. Acm Transactions on Information Systems, 2002, 20(1):116-131. doi:  10.1145/503104.503110
    [58] ZWEIG G, BURGES C J C. The Microsoft Research sentence completion challenge[R]. Technical Report MSRTR-2011-129, Microsoft, 2011.
    [59] GLADKOVA A, DROZD A, MATSUOKA S. Analogy-based detection of morphological and semantic relations with word embeddings:what works and what doesn't[C]//HLT-NAACL, 2016:8-15.
    [60] MIKOLOV T, YIH W, ZWEIG G. Linguistic regularities in continuous space word representations[C]//HLTNAACL, 2013:746-751.
    [61] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[C]//ICLR, 2015:1-15.
    [62] GROVER A, LESKOVEC J. Node2vec:Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016:855-864.
  • 加载中
计量
  • 文章访问数:  295
  • HTML全文浏览量:  159
  • PDF下载量:  821
  • 被引次数: 0
出版历程
  • 收稿日期:  2017-05-01
  • 刊出日期:  2017-09-25

目录

    /

    返回文章
    返回