An algorithm for natural language generation via text extracting
-
摘要: 文本自动生成旨在实现机器像人一样写作,减少语言工作人员的工作量,为读者传送实时、简洁的新闻报道.它可被运用在智能问答和对话、新闻的自动撰写、突发事件报道等应用中,且一直是学术界和工业界想突破的研究问题.本文将文本自动生成建模成关键词集合覆盖问题,提出了一种无监督的抽取式文本自动生成算法.该算法优化了自动文本的结构,不再是一段式文本.实验表明,该算法在大规模语料库上可取得不错效果,生成的文本覆盖信息更全面,与人工生成的文本意思更接近.Abstract: The aim of natural language generation is to achieve a state where machines can generate text automatically. This would reduce the workload of human language workers and helps us deliver real-time, concise news coverage to readers. It could be applied to many applications, such as question and answers systems, automatic news writing, incident reporting, and so on. The challenge has been one of the open problems for both academia and industry. In this paper, we model the issue as a keyword covering problem and propose an unsupervised approach to extract text for natural language generation.The experimental results illustrate that the algorithm is effective for large-scale corpus; the text coverage is more comprehensive and the generated text is closer to the manual text produced by an individual.
-
Key words:
- natural language generation /
- keyword cover problem /
- informative /
- redundancy
-
算法1 文本抽取算法 输入:关键词集合Key && User-Key; 文档集合$D$; 阈值$\alpha $; 输出:最小覆盖句子集合$C'$; //关键词覆盖问题的建模和求解 1: FOR document $d_i$ in $D$ 2: FOR sentence ${\rm Sen}_j$ in $d_i$ 3: FOR keyword $k_k$ in Key 4: IF ${\rm Sen}_j$. contain $(k_k)$ THEN 5: $Y[j][k]=1$ 6: ELSE 7: $Y[j][k]=0$ 8: END FOR 9: END FOR 10: END FOR 11:同理求$Z$ 12:求解线性规划得到$X_j$ //自动句子抽取 13: IF $X_j>\alpha$ 14: add Sen$_j$ into $C'$ 15: END IF 16: RETURN $C'$ 表 1 类别一文档标题
Tab. 1 The title of candidate documents in class one
文章编号 标题 1 专家分享中西合璧系列清新豆腐蛋糕 2 专家预示新时代中烘焙行业发展趋势内容 3 三月特殊的女人节祝福-专属系列蛋糕集锦 4 对于烘焙行业发展超越是如何突破 5 一家上海蛋糕店的成功经营, 上海青葱岁月为你打造 6 上海青葱面包店, 帮喜欢面包的有志之人实现创业梦 7 美味就在这里------一起让大家吃到好面包 8 爱的面包店退休父母的"活力加油站" 9 开蛋糕店的人现在很多么, 开店之前应该做哪些准备 10 专家为我们详解蛋糕加盟商们的悲喜忧愁 表 2 重合率
Tab. 2 Edmundson coincidence
算法 类别一 类别二 类别三 类别四 TextRank(BM25) ${\bf 0.060 6}$ 0.0 0.019 2 0.009 3 TextRank(Sim) 0.030 3 0.0 0.019 2 0.009 3 Summary 0.045 5 ${\bf 0.021 3}$ ${\bf 0.028 8}$ 0.009 3 表 3 TextRank生成文本和Summary生成文本两个模型的ROUGE值
Tab. 3 TextRank VS Summary in ROUGEn
TextRank(BM25) TextRank(Sim) Summary $C$1 $C$2 $C$3 $C$4 $C$1 $C$2 $ C$3 $ C$4 $C$1 $C$2 $C$3 C4 R-1 0.42 0.27 0.30 0.23 0.43 0.30 0.28 0.28 0.48 0.38 0.35 0.30 R-2 0.28 0.18 0.18 0.17 0.28 0.19 0.19 0.19 0.34 0.18 0.18 0.16 R-S4 0.19 0.13 0.14 0.10 0.20 0.14 0.12 0.12 0.32 0.15 0.17 0.13 R-SU4 0.21 0.14 0.15 0.12 0.22 0.15 0.14 0.14 0.32 0.15 0.18 0.15 -
[1] 万小军. 文本自动生成研究进展与趋势[R]. 北京: 北京大学, 2016: 1-2. [2] ZHANG Y, KRIEGER H U. Large-scale corpus-driven PCFG approximation of an HPSG[C]//Proceedings of the 12th International Conference on Parsing Technologies. Stroudsburg: Association for Computational Linguistics, 2011: 198-208. http://dl.acm.org/citation.cfm?id=2206352 [3] SRIPADA S, REITER E, DAVY I. Sumtime-mousam:Configurable marine weather forecast generator[J]. Expert Update, 2003, 6(3):4-10. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.2583 [4] KUKICH K. Design of a knowledge-based report generator[C]//Proceedings of the 21st Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1983: 145-150. http://dl.acm.org/citation.cfm?id=981340 [5] PORTET F, REITER E, GATT A, et al. Automatic generation of textual summaries from neonatal intensive care data[J]. Artificial Intelligence, 2009, 173(7/8):789-816. http://cn.bing.com/academic/profile?id=daa03c5f9763bf2d9784a1433235fec1&encoded=0&v=paper_preview&mkt=zh-cn [6] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 39(4):664-676. http://cn.bing.com/academic/profile?id=b122df434d8f6c5e096be6aebae20e76&encoded=0&v=paper_preview&mkt=zh-cn [7] LI S J, OUYANG Y, WANG W, et al. Multi-document summarization using support vector regression[C/OL]//Proceedings of the Document Understanding Conference. [2017-05-03]. http://www-nlpir.nist.gov/projects/duc/pubs/2007papers/pekingu.final.pdf. [8] KNIGHT K, MARCU D. Statistics-based summarization-step one: Sentence compression[C]//Senventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. [S. l]: AAAI Press, 2000: 703-710. [9] CLARKE J, LAPATA M. Global inference for sentence compression:An integer linear programming approach[J]. Journal of Artificial Intelligence Research, 2008, 31:399-429. doi: 10.1613/jair.2433 [10] FILIPPOVA K. Multi-sentence compression: Finding shortest paths in word graphs[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010: 322-330. http://dl.acm.org/citation.cfm?id=1873818 [11] THADANI K, MCKEOWN K. Supervised sentence fusion with single-stage inference[C]//International Joint Conference on Natural Language Processing. 2013: 1410-1418. http://www.zentralblatt-math.org/ioport/en/?q=an%3A10294178 [12] FUJITA A, INUI K, MATSUMOTO Y. Exploiting lexical conceptual structure for paraphrase generation[C]//International Conference on Natural Language Processing. Berlin: Springer, 2005: 908-919. doi: 10.1007/11562214_79 [13] DUBOUE P A, CHU-CARROLL J. Answering the question you wish they had asked: The impact of paraphrasing for question answering[C]//Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Stroudsburg: Association for Computational Linguistics, 2006: 33-36. http://dl.acm.org/citation.cfm?id=1614058 [14] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3):993-1022 http://cn.bing.com/academic/profile?id=7970b813cfa38c5ef170f19fa6d6738a&encoded=0&v=paper_preview&mkt=zh-cn [15] MIHALCEA R, TARAU P. TextRank: Bringing order into texts[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2004: 404-411. [16] EDMUNDSON H P. New methods in automatic extracting[J]. Journal of the ACM (JACM), 1969, 16(2):264-285. doi: 10.1145/321510.321519 [17] LIN C Y. ROUGE: A package for automatic evaluation of summaries[C/OL]//Proceedings of Workshop on Text Summarization Branches Out Post Conference Workshop of ACL 2004. [2017-05-03]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/was2004.pdf. [18] PARVEEN D, MESGAR M, STRUBE M. Generating coherent summaries of scientific articles using coherence patterns[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 772-783. https://www.researchgate.net/publication/312250859_Generating_Coherent_Summaries_of_Scientific_Articles_Using_Coherence_Patterns