Automatic extraction of corporate bankruptcy-related events from ruling documents
-
摘要: 提出了一种企业破产事件抽取框架, 该框架可以从法律裁定书等卷宗资料中检测出相应的法律事件, 并抽取出与事件相关的结构化要素信息. 该框架结合从法院所获得的裁定书等卷宗信息, 运用远程监督技术来构建模型训练数据; 再通过命名实体识别技术对句级别的文书进行序列标注; 最后结合自定义的事件触发词表与事件字典, 运用事件抽取技术对法律文书进行事件识别, 并给出对应事件的结构化信息. 实验结果表明本框架能够取得较高的事件识别精度, 是一种有效的企业破产事件抽取框架.Abstract: This paper proposes a framework for extracting corporate bankruptcy-related events from ruling documents and thus extracts structured information about the related events. Combined with ruling documents, our framework uses distant supervision to generate training data; applies named entity recognition techniques to implement sequence label tagging on sentences of litigation documents; and implements event extraction with a self-defined list of event trigger words as well as an event dictionary to detect bankruptcy-related events and gather structured information. Our experimental results demonstrate the effectiveness of the framework.
-
Key words:
- enterprise bankruptcy /
- named entity recognition /
- event extraction
-
表 1 远程监督下自动标注数据统计
Tab. 1 Statistics of automatically labeled data via distant supervision
数据集 远程监督 正例数目 负例数目 借贷纠纷 1 956 1 980 3 144 劳务纠纷 1 523 1 565 4 142 总计 3 507 3 545 7 286 表 2 NER评测结果
Tab. 2 Precision, recall, and F1 score of named entity recognition
实体类型 准确率 召回率 F1值 LOC 0.557 0.796 0.655 TIME 0.599 0.917 0.725 PER 0.825 0.803 0.814 ORG 0.839 0.797 0.818 表 3 劳务纠纷的20个子事件
Tab. 3 20 Sub-events of labor event
劳务纠纷 标签 劳务纠纷 标签 解除劳动关系 LB1 法院驳回仲裁申请 LB11 支付工资报酬 LB2 解除担保 LB12 经济补偿款 LB3 停产、裁员 LB13 拖欠工资、报酬 LB4 奖金 LB14 存在/确认劳动关系 LB5 工衣费 LB15 双倍工资差额 LB6 工作岗位 LB16 签订劳动合同期限 LB7 死亡抚恤待遇 LB17 带薪休假、加班 LB8 解除合同通知书 LB18 未签订劳动合同 LB9 吊销营业执照 LB19 工伤、补助费 LB10 达成/调解协议 LB20 表 4 事件抽取评测结果
Tab. 4 Precision, recall, and F1 score of event extraction
事件类型 准确率 召回率 F1值 劳务纠纷 0.843 0.872 0.859 借贷纠纷 0.815 0.847 0.831 平均值 0.829 0.859 0.845 -
[1] MCCALLUM A, FREITAG D, PEREIRA F. Maximum entropy markov models for information extraction and segmentation [C]//ICML, 2000, 17: 591-598. [2] LAFFERTY J, MCCALLUM A, PEREIRA F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]//Proc 18th International Conf on Machine Learning, New York: ACM, 2001: 282-289. [3] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch [J]. Journal of Machine Learning Research, 2011(12): 2493-2537. [4] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF Models for sequence tagging [J]. Computer Science, 2015: 1508. 01991v1. [5] 高丹, 彭敦陆, 刘丛. 海量法律文书中基于CNN的实体关系抽取技术 [J]. 小型微型计算机系统, 2018, 39(5): 1021-1026. DOI: 10.3969/j.issn.1000-1220.2018.05.028. [6] KOTSIANTIS S B, ZAHARAKIS I, PINTELAS P. Supervised machine learning: A review of classification techniques [J]. Emerging Artificial Intelligence Applications in Computer Engineering, 2007, 160: 3-24. [7] BELAVAGI M C, MUNIYAL B. Performance evaluation of supervised machine learning algorithms for intrusion detection [J]. Procedia Computer Science, 2016, 89: 117-123. DOI: 10.1016/j.procs.2016.06.016. [8] CARLSON A, BETTERIDGE J, WANG R C, et al. Coupled semi-supervised learning for information extraction [C]//Proceedings of the Third ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 101-110. [9] HAN J, NGAN K N, LI M, et al. Unsupervised extraction of visual attention objects in color images [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2005, 16(1): 141-145. [10] ZENG D, LIU K, CHEN Y, et al. Distant supervision for relation extraction via piecewise convolutional neural networks [C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. New York: ACM, 2015: 1753-1762. [11] MINTZ M, BILLS S, SNOW R, et al. Distant supervision for relation extraction without labeled data [C]//Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2. Association for Computational Linguistics, 2009: 1003-1011. [12] 王礼敏. 面向法律文书的中文命名实体识别方法研究 [D]. 江苏 苏州: 苏州大学, 2018.