中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

抽取式自动文本生成算法

艾丽斯 唐卫红 傅云斌 董启民 郑建兵 高明

艾丽斯, 唐卫红, 傅云斌, 董启民, 郑建兵, 高明. 抽取式自动文本生成算法[J]. 华东师范大学学报(自然科学版), 2018, (4): 70-79. doi: 10.3969/j.issn.1000-5641.2018.04.007
引用本文: 艾丽斯, 唐卫红, 傅云斌, 董启民, 郑建兵, 高明. 抽取式自动文本生成算法[J]. 华东师范大学学报(自然科学版), 2018, (4): 70-79. doi: 10.3969/j.issn.1000-5641.2018.04.007
AI Li-si, TANG Wei-hong, FU Yun-bin, DONG Qi-min, ZHENG Jian-bing, GAO Ming. An algorithm for natural language generation via text extracting[J]. Journal of East China Normal University (Natural Sciences), 2018, (4): 70-79. doi: 10.3969/j.issn.1000-5641.2018.04.007
Citation: AI Li-si, TANG Wei-hong, FU Yun-bin, DONG Qi-min, ZHENG Jian-bing, GAO Ming. An algorithm for natural language generation via text extracting[J]. Journal of East China Normal University (Natural Sciences), 2018, (4): 70-79. doi: 10.3969/j.issn.1000-5641.2018.04.007

抽取式自动文本生成算法

doi: 10.3969/j.issn.1000-5641.2018.04.007
基金项目: 

国家重点研发计划项目 2016YFB1000905

国家自然科学基金广东省联合重点项目 U1401256

国家自然科学基金 61402177

国家自然科学基金 61672234

国家自然科学基金 61402180

国家自然科学基金 61363005

国家自然科学基金 61472321

详细信息
    作者简介:

    艾丽斯, 女, 硕士研究生, 研究方向为自然语言处理.E-mail:irisinsh@163.com

    通讯作者:

    董启民, 男, 中学一级教师, 研究方向为信息处理技术.E-mail:418976195@qq.com

  • 中图分类号: TP391

An algorithm for natural language generation via text extracting

  • 摘要: 文本自动生成旨在实现机器像人一样写作,减少语言工作人员的工作量,为读者传送实时、简洁的新闻报道.它可被运用在智能问答和对话、新闻的自动撰写、突发事件报道等应用中,且一直是学术界和工业界想突破的研究问题.本文将文本自动生成建模成关键词集合覆盖问题,提出了一种无监督的抽取式文本自动生成算法.该算法优化了自动文本的结构,不再是一段式文本.实验表明,该算法在大规模语料库上可取得不错效果,生成的文本覆盖信息更全面,与人工生成的文本意思更接近.
  • 图  1  新文本生成

    Fig.  1  New text generation

    图  2  类别一文档集生成的自动文本

    Fig.  2  System summarization of candidate documents in class one

    算法1  文本抽取算法
    输入:关键词集合Key && User-Key; 文档集合$D$; 阈值$\alpha $;
    输出:最小覆盖句子集合$C'$;
    //关键词覆盖问题的建模和求解
    1: FOR document $d_i$ in $D$
    2: FOR sentence ${\rm Sen}_j$ in $d_i$
    3:   FOR keyword $k_k$ in Key
    4:    IF ${\rm Sen}_j$. contain $(k_k)$ THEN           
    5:     $Y[j][k]=1$
    6:    ELSE
    7:     $Y[j][k]=0$
    8:   END FOR
    9: END FOR
    10: END FOR
    11:同理求$Z$
    12:求解线性规划得到$X_j$
    //自动句子抽取
    13: IF $X_j>\alpha$
    14: add Sen$_j$ into $C'$
    15: END IF
    16: RETURN $C'$
    下载: 导出CSV

    表  1  类别一文档标题

    Tab.  1  The title of candidate documents in class one

    文章编号标题
    1专家分享中西合璧系列清新豆腐蛋糕
    2专家预示新时代中烘焙行业发展趋势内容
    3三月特殊的女人节祝福-专属系列蛋糕集锦
    4对于烘焙行业发展超越是如何突破
    5一家上海蛋糕店的成功经营, 上海青葱岁月为你打造
    6上海青葱面包店, 帮喜欢面包的有志之人实现创业梦
    7美味就在这里------一起让大家吃到好面包
    8爱的面包店退休父母的"活力加油站"
    9开蛋糕店的人现在很多么, 开店之前应该做哪些准备
    10专家为我们详解蛋糕加盟商们的悲喜忧愁
    下载: 导出CSV

    表  2  重合率

    Tab.  2  Edmundson coincidence

    算法类别一类别二类别三类别四
    TextRank(BM25)${\bf 0.060 6}$0.00.019 20.009 3
    TextRank(Sim)0.030 30.00.019 20.009 3
    Summary0.045 5 ${\bf 0.021 3}$${\bf 0.028 8}$0.009 3
    下载: 导出CSV

    表  3  TextRank生成文本和Summary生成文本两个模型的ROUGE值

    Tab.  3  TextRank VS Summary in ROUGEn

    TextRank(BM25)TextRank(Sim)Summary
    $C$1$C$2$C$3$C$4$C$1$C$2$ C$3$ C$4$C$1$C$2$C$3C4
    R-10.420.270.300.230.430.300.280.280.480.380.350.30
    R-20.280.180.180.170.280.190.190.190.340.180.180.16
    R-S40.190.130.140.100.200.140.120.120.320.150.170.13
    R-SU40.210.140.150.120.220.150.140.140.320.150.180.15
    下载: 导出CSV
  • [1] 万小军. 文本自动生成研究进展与趋势[R]. 北京: 北京大学, 2016: 1-2.
    [2] ZHANG Y, KRIEGER H U. Large-scale corpus-driven PCFG approximation of an HPSG[C]//Proceedings of the 12th International Conference on Parsing Technologies. Stroudsburg: Association for Computational Linguistics, 2011: 198-208. http://dl.acm.org/citation.cfm?id=2206352
    [3] SRIPADA S, REITER E, DAVY I. Sumtime-mousam:Configurable marine weather forecast generator[J]. Expert Update, 2003, 6(3):4-10. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.2583
    [4] KUKICH K. Design of a knowledge-based report generator[C]//Proceedings of the 21st Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1983: 145-150. http://dl.acm.org/citation.cfm?id=981340
    [5] PORTET F, REITER E, GATT A, et al. Automatic generation of textual summaries from neonatal intensive care data[J]. Artificial Intelligence, 2009, 173(7/8):789-816. http://cn.bing.com/academic/profile?id=daa03c5f9763bf2d9784a1433235fec1&encoded=0&v=paper_preview&mkt=zh-cn
    [6] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 39(4):664-676. http://cn.bing.com/academic/profile?id=b122df434d8f6c5e096be6aebae20e76&encoded=0&v=paper_preview&mkt=zh-cn
    [7] LI S J, OUYANG Y, WANG W, et al. Multi-document summarization using support vector regression[C/OL]//Proceedings of the Document Understanding Conference. [2017-05-03]. http://www-nlpir.nist.gov/projects/duc/pubs/2007papers/pekingu.final.pdf.
    [8] KNIGHT K, MARCU D. Statistics-based summarization-step one: Sentence compression[C]//Senventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. [S. l]: AAAI Press, 2000: 703-710.
    [9] CLARKE J, LAPATA M. Global inference for sentence compression:An integer linear programming approach[J]. Journal of Artificial Intelligence Research, 2008, 31:399-429. doi:  10.1613/jair.2433
    [10] FILIPPOVA K. Multi-sentence compression: Finding shortest paths in word graphs[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010: 322-330. http://dl.acm.org/citation.cfm?id=1873818
    [11] THADANI K, MCKEOWN K. Supervised sentence fusion with single-stage inference[C]//International Joint Conference on Natural Language Processing. 2013: 1410-1418. http://www.zentralblatt-math.org/ioport/en/?q=an%3A10294178
    [12] FUJITA A, INUI K, MATSUMOTO Y. Exploiting lexical conceptual structure for paraphrase generation[C]//International Conference on Natural Language Processing. Berlin: Springer, 2005: 908-919. doi:  10.1007/11562214_79
    [13] DUBOUE P A, CHU-CARROLL J. Answering the question you wish they had asked: The impact of paraphrasing for question answering[C]//Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Stroudsburg: Association for Computational Linguistics, 2006: 33-36. http://dl.acm.org/citation.cfm?id=1614058
    [14] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3):993-1022 http://cn.bing.com/academic/profile?id=7970b813cfa38c5ef170f19fa6d6738a&encoded=0&v=paper_preview&mkt=zh-cn
    [15] MIHALCEA R, TARAU P. TextRank: Bringing order into texts[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2004: 404-411.
    [16] EDMUNDSON H P. New methods in automatic extracting[J]. Journal of the ACM (JACM), 1969, 16(2):264-285. doi:  10.1145/321510.321519
    [17] LIN C Y. ROUGE: A package for automatic evaluation of summaries[C/OL]//Proceedings of Workshop on Text Summarization Branches Out Post Conference Workshop of ACL 2004. [2017-05-03]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/was2004.pdf.
    [18] PARVEEN D, MESGAR M, STRUBE M. Generating coherent summaries of scientific articles using coherence patterns[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 772-783. https://www.researchgate.net/publication/312250859_Generating_Coherent_Summaries_of_Scientific_Articles_Using_Coherence_Patterns
  • 加载中
图(2) / 表(4)
计量
  • 文章访问数:  159
  • HTML全文浏览量:  91
  • PDF下载量:  315
  • 被引次数: 0
出版历程
  • 收稿日期:  2017-06-19
  • 刊出日期:  2018-07-25

目录

    /

    返回文章
    返回