Chinese named entity relation extraction for enterprise knowledge graph construction
-
摘要: 企业知识图谱是针对金融领域为描述企业间商业往来关系而构建的一类垂直领域知识库.尽管垂直领域知识图谱在领域覆盖的广度上不如开放知识图谱,但是它对知识准确率的要求却远远高于开放知识图谱,因此虽然近些年开放知识图谱取得了很大的进展,但在垂直领域中却并未得到深入应用,尤其是商业领域,其对企业知识图谱提出了很大的需求.针对企业知识图谱目前在关系抽取效果上的局限性,在分析了实体关系抽取研究现状的基础上,提出了一种基于分类的中文实体关系抽取方法.该方法使用最大熵模型,通过对上市公司公报数据进行实验分析,从而寻找到该关系抽取的最优特征模板,并使在企业公报这一数据集上的准确率普遍达到85%以上.Abstract: The enterprise knowledge graph is a kind of domain knowledge base for the financial field to describe business relationships between enterprises. Although the domain knowledge graph is not broadly covered in the field, the precision of the knowledge is better than with an open knowledge graph. Despite the fact that open knowledge graphs have made significant advancements in recent years, vertical fields-especially business-have not seen in-depth applications in practice; this has resulted in significant demands on the enterprise knowledge graph. This paper proposes a Chinese entity relation extraction method based on classification for the limitation of extraction results. In this method, the maximum entropy model is used to analyze the data of selected companies' announcements to determine the optimal feature template. The results show that accuracy rates reach over 85% in the enterprise bulletin data set.
-
表 1 动词词频前十排序
Tab. 1 Top ten list of verbs
动词 词频 持有 17 553 963 投资 14 848 084 发行 11 762 727 转让 11 015 165 简称 7 799 445 合并 7 403 418 签订 6 024 973 委托 5 246 879 购买 4 969 604 收购 4 810 237 表 2 关系类型定义
Tab. 2 Relationship type definition
关系类型 描述关系的动词集合 持有 持有,合计持有,应付持有,合并持有,转让持有,拟转让持有,…… 投资 投资,对外投资,投资建设,合作投资,投资并购,投资担保,…… 转让 转让,挂牌转让,出资转让,限制转让,作价转让,优先转让,…… 合并 合并,吸收合并,进行合并,购买合并,合并收购,收购合并,…… 收买 收买,并购,收购,拟收购,要约收购,收购完成,出资收购,…… 表 3 符号定义
Tab. 3 Symbol definition
符号 描述 X 经过分词处理后的词向量的有限集合 Y 输出状态的有限集合 x 任一词向量 y 任一输出状态 p(y|x) 给定上下文x的情况下输出y的概率 $\hat p$(y|x) 给定上下文x条件下输出y的概率的经验分布 p(x, y) 给定上下文x且输出y的联合概率分布 $\hat p$(x|y) 给定上下文x且输出y的联合概率的经验分布 $\hat p$(x) x在样本中出现的概率分布 f(x, y) 表征输入特征x和输出状态y之间关系的特征方程 Ep〈fi〉 特征函数关于模型p(y|x)的期望 E$\hat p$〈fi〉 特征函数关于经验分布$\hat p$(y|x)的期望 Z(x) 规范化因子 λi 特征的权值 表 4 词汇特征模板
Tab. 4 Vocabulary feature template
特征描述 特征表示 关键词、实体上下文1-Gram词汇特征 $w_{i-2}$, $w_{i-1}$, $w_{i}$, $w_{i+1}$, $w_{i+2}$ 关键词、实体上下文1-Gram词性特征 $t_{i-2}$, $t_{i-1}$, $t_{i}$, $t_{i+1}$, $t_{i+2}$ 关键词、实体上下文2-Gram词汇特征 $w_{i-2}w_{i-1}$, $w_{i-1}w_{i}$, $w_{i}w_{i+1}$, $w_{i+1}w_{i+2}$ 关键词、实体上下文2-Gram词性特征 $t_{i-2}t_{i-1}$, $t_{i-1}t_{i}$, $t_{i}t_{i+1}$, $t_{i+1}t_{i+2}$ 关键词、实体长度特征 len$_{\rm key}$, len$_{e}$ 关键词、实体位置特征 $p_{\rm key}, (p_{\rm key}- p_{e1}), (p_{e2}- p_{\rm key})$ 表 5 “母公司关系”的特定词汇特征
Tab. 5 Specific lexical features of relationship between the parent companies
特征名称 特征描述 t - i=:nt 关键词前面第i个词语的词性包含nt t + j=:nt 关键词后面第j个词语的词性包含nt t - 1=uj 关键词前面一个词语的词性等于uj w - 1=的 关键词前面一个词语的词汇等于“的” t + 1=n 关键词后面一个词语的词性等于n w+1 =名称 关键词后面一个词语的词汇等于“名称” t - i=:v和w - i=:是 关键词前面第i个词语的词性包含v并且词汇包含“是” 表 6 句子成分
Tab. 6 Sentence composition
符号 句子成分描述 ADJP 形容词短语 IP 句子 JJ 名词修饰语 NN 普通名词 NP 名词短语 NT 时间名词 P 介词 PP 介词短语 PU 标点符号 VC 系动词 VP 动词短语 VRD 动词结果复合 VV 普通动词 算法1 获取特征模板 输入:训练数据的集合, $\langle$sample$\rangle$.
输出:训练数据特征向量的集合, List$\langle$featureVector$\rangle$put all the entities into ANSJ.
Result = {}.
for each sample in Set$\langle$sample$\rangle$ do
get sample label and put the label into featureVector;
get two entities and keyword then get unigram word of them in the sliding window for 2;
for each unigram word do
put it into featureVector and label it as 1;
end for
get unigram position feature of the entities and keyword in the sliding window for 2;
get bigram word and position feature of the entities and keyword also in the sliding window for 2;
then get the length of the entities and keyword;
get the distance between entities and keyword;
put all the features into featureVector like the unigram features;
add all featureVectors into Result;
end for
return Result; -
[1] PUJARA J, MIAO H, GETOOR L, et al. Knowledge graph identification[C]//International Semantic Web Conference. New York: Springer-Verlag, Inc, 2013: 542-557. [2] DESHPANDE O, LAMBA D S, TOURN M, et al. Building, maintaining, and using knowledge bases: A report from the trenches[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2013: 1209-1220. [3] HEARST M A. Automatic acquisition of hyponyms from large text corpora[C]//Proceeding of the 14th Conference on Computational Linguistics. 1992: 539-545. [4] WU W T, LI H S, WANG H X, et al. Probase: A probabilistic taxonomy for text understanding[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012: 481-492. [5] ZHOU G D, SU J, ZHANG J, et al. Exploring various knowledge in relation extraction[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005: 427-434. [6] ZHOU G D, ZHANG M, JI D H, et al. Tree kernel-based relation extraction with context-sensitive structured parse Tree information[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. DBLP, 2007: 728-736. [7] BRIN S. Extracting patterns and relations from the World Wide Web[C]//WebDB'98 Selected Papers from the International Workshop on the World Wide Web and Databases. Berlin: Springer, 1998: 172-183. [8] AGICHTEIN E, GRAVANO L. Snowball: Extracting relations from large plain-text collections[C]//ACM Conference on Digital Libraries. ACM, 2000: 85-94. [9] HASEGAWA T, SEKINE S, GRISHMAN R. Discovering relations among named entities from large corpora[C]//Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004: 415. [10] 郭喜跃, 何婷婷, 胡小华, 等.基于句法语义特征的中文实体关系抽取[J].中文信息学报, 2014, 28(6):183-189. http://www.cqvip.com/QK/94913X/201602/667967451.html [11] KAMBHATLA N. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations[C]//Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 2004: Article No 22. [12] RATNAPARKHI A. Maximum entropy models for natural language ambiguity resolution[D]. Pennsylvania: University of Pennsylvania, 1998. [13] 李丹. 基于朴素贝叶斯方法的中文文本分类研究[D]. 石家庄: 河北大学, 2011. [14] 薛俊欣. 条件随机场模型研究及应用[D]. 济南: 山东大学, 2014. [15] DARROCH J N, RATCLIFF D. Generalized iterative scaling for log-linear models[J]. Annals of Mathematical Statistics, 1972, 43(5):1470-1480. doi: 10.1214/aoms/1177692379 [16] BERGER A. The improved iterative scaling algorithm: A gentle introduction[R/OL]. (1997-12-12)[2017-05-19]. http://www.doc88.com/p-1806889293798.html. [17] 胡宝顺, 王大玲, 于戈, 等.基于句法结构特征分析及分类技术的答案提取算法[J].计算机学报, 2008, 31(4):662-676. http://www.docin.com/p-10505523.html [18] OLSON D L, DELEN D. Advanced Data Mining Techniques[M]. Berlin:Springer, 2008.