中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于前向分步算法的文档实体排序

王燕华

王燕华. 基于前向分步算法的文档实体排序[J]. 华东师范大学学报(自然科学版), 2018, (1): 91-102, 145. doi: 10.3969/j.issn.1000-5641.2018.01.009
引用本文: 王燕华. 基于前向分步算法的文档实体排序[J]. 华东师范大学学报(自然科学版), 2018, (1): 91-102, 145. doi: 10.3969/j.issn.1000-5641.2018.01.009
WANG Yan-hua. Forward stagewise additive modeling for entity ranking in documents[J]. Journal of East China Normal University (Natural Sciences), 2018, (1): 91-102, 145. doi: 10.3969/j.issn.1000-5641.2018.01.009
Citation: WANG Yan-hua. Forward stagewise additive modeling for entity ranking in documents[J]. Journal of East China Normal University (Natural Sciences), 2018, (1): 91-102, 145. doi: 10.3969/j.issn.1000-5641.2018.01.009

基于前向分步算法的文档实体排序

doi: 10.3969/j.issn.1000-5641.2018.01.009
基金项目: 

上海市科技兴农推广项目 2015第3-2号

详细信息
    作者简介:

    王燕华, 男, 硕士研究生, 研究方向为机器学习.E-mail:yhwang917@gmail.com

  • 中图分类号: P311

Forward stagewise additive modeling for entity ranking in documents

  • 摘要: 文档中的关键实体可以抽象概括文本所描述的事件(或话题)的主体,推动面向实体的检索和问答系统等方面的研究.然而,文档中的实体是无序的,对文本中的实体进行排序显得尤为重要.提取文本实体特征并借助维基百科和词汇分布表示引入外部特征,提出了一种基于前向分步算法(Forward Stagewise Algorithm,FSAM)的排序模型LA-FSAM(FSAM based on AUC Metric and LogisticFunction).该模型利用曲线下面积(Area Under the Curve,AUC)准则构造损失函数,逻辑斯谛函数整合实体特征,最后使用随机梯度下降法求解模型参数.通过LA-FSAM与基线方法的实验对比证明了所提方法的有效性.
  • 图  1  随机跳转概率 $\gamma$ 对LA-FASM的影响

    Fig.  1  Evaluation of $\gamma$

    图  2  基函数数量对平均损失函数与平均AUC的影响

    Fig.  2  Evaluation of number of basis function

    图  3  不同特征对LA-FASM的影响

    Fig.  3  Evaluation of features

    表     

    算法1  LA-FSAM算法
    输入:标注文档集合 $DL=\{dl_1, dl_2, \cdots, dl_n\}$ , 损失函数 $L(P, N, f(X))$ , 基函数集 $\{b(X;\overrightarrow{\beta})\}$
    输出:加法模型 $f(X)$
    1.初始化 $f_0(X)=0$
    2. For $t=1, \cdots, T$
    $(\alpha_t, \overrightarrow{\beta_t})=\text {arg} \min\limits_{\alpha, {\overrightarrow{\beta}}} \sum\limits_{i=1}^nL(P_i, N_i, f_{t-1}(X_i)+\alpha b(X_i;\overrightarrow{\beta}))$
    $f_t(X)=f_{t-1}+\alpha_tb(X;\overrightarrow{\beta_t})$
    3.得到加法模型
    $f(X)=\sum\limits_{t-1}^T+\alpha_tb(X;\overrightarrow{\beta_t})$
    下载: 导出CSV

    表     

    算法2  LA-FSAM模型学习算法
    输入:标注文档集合 $DL=\{dl_1, dl_2, \cdots, dl_n\}$ , 学习速度 $lr$
    输出: $\alpha$ 和 $\overrightarrow{\beta}$
    1. $j=0$
    2.初始化 $\alpha^{(0)}$ , $\overrightarrow{\beta}^{(0)}$
    3.若 $J(\alpha, \overrightarrow{\beta})$ 未收敛, 则循环做
          对于训练数据集中的每一篇文档 $dl_i$
               $\alpha^{(j+1)}=\alpha ^{(j)}-lr\frac{\partial Ji(\alpha ^{(j)}, \overrightarrow{\beta}^{(j)})}{\partial \alpha}$
               $\overrightarrow{\beta}^{(j+1)}=\overrightarrow{\beta}^{(j)}-lr \frac{\partial Ji(\alpha ^{(j)}, \overrightarrow{\beta}^{(j)})}{\alpha\overrightarrow{\beta}}$
           $j=j+1$
    下载: 导出CSV

    表  1  数据集重要实体数量统计

    Tab.  1  Count of key entities statistics of documents

    实体数量文档数量 占文档比例/%
    $\geq 1$ 500 100
    $\geq 2$ 487 97.4
    $ \geq 3$ 389 77.8
    $\geq 4$ 201 40.2
    $ \geq 5$ 89 17.8
    下载: 导出CSV

    表  2  LA-FSAM模型与其他算法平均准确率对比

    Tab.  2  Comparison of average precision between LA-FSAM and other methods

    Method Avg.P@1/% Avg.P@2/% Avg.P@3/% Avg.P@4/% Avg.AUC/%
    Frequency 86.4 82.8 81.1 78.6 69.2
    Position 92.2 82.5 72.2 63.3 86.2
    TextRank 83.5 76.3 67.5 69.0 65.5
    WikiRank 68.0 62.6 61.4 68.5 54.5
    WordEmbeddingRank 55.3 52.0 57.0 61.9 61.9
    RankSVM 92.2 85.9 84.6 77.3 87.6
    LA-FSAM 94.2 88.9 85.5 79.8 88.3
    下载: 导出CSV
  • [1] FiNKEL J R, GRENAGER T, MANNING C. Incorporating non-local information into information extraction systems by gibbs sampling[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005: 363-370. doi:  10.3115/1219840.1219885
    [2] ZHANG W, FENG W, WANG J Y. Integrating semantic relatedness and words' intrinsic features for keyword extraction[C]//Proceedings of the 23rd International Join Conference on Artificial Intelligence. 2013: 2225-2231. http://dl.acm.org/citation.cfm?id=2540448
    [3] HOFMANN K, TSAGKIAS M, MEIJ E, et al. The impact of document structure on keyphrase extraction[C]//Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009: 1725-1728. doi:  10.1145/1645953.1646215
    [4] LI Z H, ZHOU D, JUAN Y F, et al. Keyword extraction for social snippets[C]//Proceedings of the 19th International Conference on World Wide Web. ACM, 2010: 1143-1144. http://dl.acm.org/citation.cfm?id=1772845
    [5] JIANG X, HU Y H, LI H. A ranking approach to keyphrase extraction[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009: 756-757. http://dl.acm.org/citation.cfm?id=1572113
    [6] ZHANG F, HUANG L E, PENG B. WordTopic-MultiRank: A new method for automatic keyphrase extraction[C]//Proceedings of the 6th International Joint Conference on Natural Language. ACL, 2013: 10-18. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.687.9273
    [7] LIU Z Y, HUANG W Y, ZHENG Y B, et al. Automatic keyphrase extraction via topic decomposition[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010: 366-376. http://dl.acm.org/citation.cfm?id=1870694
    [8] MIHALCEA R, TARAU P. TextRank: Bringing order into texts[C]//Conference on Empirical Methods in Natural Language Processing. ACL, 2004: 404-411. http://digital.library.unt.edu/ark:/67531/metadc30962/
    [9] WANG J H, LIU J Y, WANG C. Keyword extraction based on pagerank[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer, 2007: 857-864. http://dl.acm.org/citation.cfm?id=1764539
    [10] WANG R, LIU W, MCDONALD C. Using word embeddings to enhance keyword identification for scientific publications[C]//Australasian Database Conference. Berlin: Springer International Publishing, 2015: 257-268.
    [11] LIU Z Y, LI P, ZHENG Y B, et al. Clustering to find exemplar terms for keyphrase extraction[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009: 257-266. http://dl.acm.org/citation.cfm?id=1699544
    [12] DEMARTINI G, MISSEN M M S, BLANCO R, et al. Entity summarization of news articles[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2010: 795-796. http://dl.acm.org/citation.cfm?id=1835449.1835620
    [13] BASHIR S, AFZAL W, BAIG A R. Opinion-based entity ranking using learning to rank[J]. Applied Soft Computing, 2016, 38:151-163. doi:  10.1016/j.asoc.2015.10.001
    [14] SCHUHMACHER M, DIETZ L, PONZETTO S P. Ranking entities for Web queries through text and knowledge[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015: 1461-1470. http://dl.acm.org/citation.cfm?id=2806480
    [15] HASTIE T, FRIEDMAN J, TIBSHIRANI R. The Elements of Statistical Learning[M]//Springer Series in Statistics. New York: Springer-Verlag, 2001: 342-343.
    [16] KANG C S, YIN D W, ZHANG R Q, et al. Learning to rank related entities in Web search[J]. Neurocomputing, 2015, 166:309-318. doi:  10.1016/j.neucom.2015.04.004
    [17] KANG C S, VADREVU S, ZHANG R Q, et al. Ranking related entities for Web search queries[C]//Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 2011: 67-68. http://dl.acm.org/citation.cfm?id=1963227
    [18] GRAUS D, TSAGKIAS M, WEERKAMP W, et al. Dynamic collective entity representations for entity ranking[C]//Proceedings of the 9th ACM International Conference on Web Search and Data Mining. ACM, 2016: 595-604.
    [19] LI H. Learning to Rank for Information Retrieval and Natural Language Processing[C/OL]//Synthesis Lectures on Human Language Technologies #26. 2nd ed. [S. l]: Morgan and Claypool Publishers, 2014[2016-07-01]. http://www.morganclaypool.com/doi/suppl/10.2200/S00607ED2V01Y201410HLT026/suppl_file/li_Ch1.pdf.
    [20] JIJKOUN V, KHALID M A, MARX M, et al. Named entity normalization in user generated content[C]//Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. ACM, 2008: 23-30 http://dl.acm.org/citation.cfm?id=1390755
    [21] 李航.统计学习方法[M].北京:清华大学出版社, 2012:137-145.
    [22] BRODER A, KUMAR R, MAGHOUL F, et al. Graph structure in the Web[J]. Computer Networks, 2000, 33(1):309-320. https://ub-madoc.bib.uni-mannheim.de/36329/
    [23] FENG W, WANG J Y. Incorporating heterogeneous information for personalized tag recommendation in social tagging systems[C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2012: 1276-1284. http://dl.acm.org/citation.cfm?id=2339729
    [24] TRAN G, ALRIFAI M, HERDER E. Timeline summarization from relevant headlines[C]//European Conference on Information Retrieval. Springer International Publishing, 2015: 245-256. doi:  10.1007/978-3-319-16354-3_26
    [25] JOACHIMS T. Training linear SVMs in linear time[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2006: 217-226. http://dl.acm.org/citation.cfm?id=1150429
  • 加载中
图(3) / 表(4)
计量
  • 文章访问数:  226
  • HTML全文浏览量:  83
  • PDF下载量:  404
  • 被引次数: 0
出版历程
  • 收稿日期:  2016-12-01
  • 刊出日期:  2018-01-25

目录

    /

    返回文章
    返回