Fraudulent medical behavior detection based on hybrid approach
-
摘要: 随着医保制度的不断完善,医保覆盖率的不断扩大,医保基金的正常运转已经与人民大众的切身利益密切相关.然而,频繁就医、分解住院和异常费用支出等欺诈行为的频繁发生,极大地威胁着医保基金的正常运转.本文先使用随机森林的方法分病种进行特征选择,然后通过基于Clustering-Based Local Outlier Factor(CBLOF)的方法以及改进的CBLOF方法检测异常的结算费用.同时通过基于规则的方法检测频繁就医和分解住院行为.通过在真实医保结算数据上进行实验,实验结果证明了方法的可行性和有效性.最后,本文给出了医保基金监督平台的系统框架,通过该平台对透视分析的结果进行可视化展示.Abstract: With continuous improvement of medical insurance system, coverage of medical insurance continues to expand. The normal operation of medical insurance funds has been closely related with the vital interests of the people. However, frequent occurrence of fraudulent behaviors such as frequent hospitalization, hospitalization decomposition, abnormal fees threaten the normal operation of funds. This paper firstly used random forest method to select different features according to different diseases. Then the paper applied CBLOF-based and improved CBLOF methods to detect abnormal fees. What's more, we utilized rule-based method to identity frequent hospitalization and hospitalization decomposition. Extensive experiments on real medical claim datasets demonstrate the effectiveness and efficiency of our proposal. Finally, this paper proposed a medical insurance fund supervisory system, which can display results of pivot analysis with the help of Echarts.
-
Key words:
- outlier detection /
- local outlier factor /
- CBLOF /
- hospitalization decomposition
-
算法1.基于CBLOF的异常检测方法 输入:住院记录 $D$ , 簇大小划分阈值 $\alpha $ , 簇集合cluster_set 数据记录与簇最大相似度阈值max_sim 输出:每个簇所包含的记录列表, 每条记录对应的lof值 1.初始化簇集合cluster_set 2. for record in $D$ do 3. if record为第一条记录then 4. addNewCluster(record) 5. else 6. for cluster in cluster_set do 7. computeSimilarity(record, cluster) 8. 用cur_similarity记录最大相似度, 用index记录对应的簇索引号 9. if cur_similarity $>$ max_sim then 10. addToCluster(record, index) 11. else 12. addNewCluster(record) 13. for cluster in cluster_set do 14. if cluster的大小超过 $\alpha \cdot| D|$ then 15. cluster标记为 $L$ 16. else 17. cluster标记为 $S$ 18. for record in $D$ do 19. if record属于 $c_i $ , 且 $c_i $ 标记为 $S$ then 20. lof(record)=computedLOF_S(record, cluster_set) 21. else 22. lof(record)=computedLOF_L(record, $c_i )$ 算法2.基于改进的CBLOF的异常检测方法 输入:住院记录 ${D}'$ , 大簇更新LOF阈值 $\beta $ , 簇的集合cluster_set 输出:每条新记录对应的lof值 1. for record in ${D}'$ do 2. 使用Squeezer方法进行聚类, 聚类结果为 $c_i $ , 簇索引为index 3. if $c_i $ 标记为 $S$ then 4. addToCluster(record, index) 5. for line in $c_i $ do 6. lof(line)=updateLOF_S(line, index) 7. else 8. $cnt$ = $cnt$ +1 9. lof(record)=computeLOF_L(record, index) 10. if $\dfrac{cnt}{| {c_i }|}>\beta $ then 11. for line in $c_i$ do 12. lof(line)=updateLOF_L(line, index) 13. cnt=0 表 1 分解住院示例
Tab. 1 Examples of hospitalization decomposition
个人编号 医院编号 入院病种 入院日期 出院日期 057FCF30903 018FF7841008E 腰椎间盘突出 20150611 20150617 057FCF30903 018FF7841008E 腰椎间盘突出 20150617 20150624 057FCF30903 018FF7841008E 腰椎间盘突出 20150624 20150701 表 2 数据集统计信息
Tab. 2 Statistics of datasets
数据类型 数据条目 数据集关键字段 缴费明细 63700007 区县代码、个人编号、缴费年月、年度工资、缴费工资、 缴费金额、单位缴费、个人缴费 个人信息 429421 区县代码、个人编号、在职与离退状态、工作单位类型、 出生日期、性别 住院信息 722883 区县代码、个人编号、就医序号、医疗机构代码医院等级、总费用、 甲类费用、乙类费用、非基本费用、药品费、起付线、报销比例、 个人账户支付、统筹账户支付、个人自付、入院日期、出院日期、 入院病种名称、出院病种名称 门诊信息 1255139 区县代码、个人编号、就医序号、医疗机构代码、总费用、甲类费用、 乙类费用、非基本费用、起付线、报销比例、个人账户支付、统筹账 户支付、补保支付、个人自付、结算时间 住院明细 167254167 就医序号、三大目录id、药品/诊疗名称、单价、数量、限价、总费用 门诊明细 2952968 就医序号、三大目录id、药品/诊疗名称、单价、数量、限价、总费用 表 3 检出率比较
Tab. 3 Comparison of coverage
算法(数据量) CBLOF 改进的CBLOF 1% (4) 2 (5%) 4(10%) 2% (10) 6(15%) 9(23%) 4% (19) 15(38%) 19(48%) 6% (29) 22(56%) 23(59%) 8% (38) 28(71%) 29(74%) 10% (48) 35(89.7%) 35(89.7%) 12% (57) 37(94.8%) 37(94.8%) 14% (67) 38(97.4%) 39(100%) 15% (72) 39(100%) 39(100%) 表 4 频繁就医和分解住院检测结果
Tab. 4 Results of frequent hospitalization and hospitalization decomposition detection
检测类型 数据条目 所占比例 频繁就医 675 0.10% 分解住院 621 0.15% 表 5 分解住院行为检测
Tab. 5 Results of hospitalization decomposition detection
医疗机构代码 医院等级 出现次数 所占比例 8C78D9FC2943A 二级 1 287 30.19% 018FF7841008E 三级 1 261 29.5% 9E9A032EDE596 二级 226 5.3% D30AFB8296CF7 三级 217 5.2% -
[1] SHI Y, SUN C, LI Q, et al. A fraud resilient medical insurance claim system[C]//Thirtieth AAAI Conference on Artificial Intelligence. USA:AAAI Press, 2016:4393-4394. [2] XIE Z P, LI X Y, WU W Y, et al. An improved outlier detection algorithm to medical insurance[J]. IDEAL, 2016:436-444. doi: 10.1007/978-3-319-46257-8_47 [3] DIONNE G, GAGNé R. Replacement cost endorsement and opportunistic fraud in automobile insurance[J]. Journal of Risk & Uncertainty, 2002, 24(3):213-230. http://econpapers.repec.org/paper/fthetcori/00-01.htm [4] SKIBA J M. A phenomenological study of the challenges and barriers facing insurance fraud investigators[J]. Journal of Insurance Regulation, 2013:131-136. http://gradworks.proquest.com/35/67/3567156.html [5] KRAUSE J H. A patient-centered approach to health care fraud recovery[J]. Journal of Criminal Law & Criminology, 2006, 96(2):579-619. https://dialnet.unirioja.es/servlet/articulo?codigo=2245097 [6] LORENZ F A. Healthcare fraud in the United States:Assessing current policy and its role in fraud prevention[J]. California State University Northridge, 2013:221-227. http://scholarworks.calstate.edu/handle/10211.2/3246 [7] 李亮. 基于成本-收益理论的社会医疗保险欺诈问题研究[D]. 长沙: 湖南大学, 2011. http://cdmd.cnki.com.cn/Article/CDMD-10532-1012491622.htm [8] 王明慧, 陶四海.我国大病医疗保险实施的影响因素分析[J].经营管理者, 2013, 21:298-298. http://www.cnki.com.cn/Article/CJFDTOTAL-GLZJ201321261.htm [9] 夏宏, 汪凯, 张守春.医疗保险中的欺诈与反欺诈问题[J].现代预防医学, 2007, 34(20):3907-3908. doi: 10.3969/j.issn.1003-8507.2007.20.052 [10] COHEN W W. Fast effective rule induction[J]. Machine Learning Proceedings, 1995, 46(2):115-123. https://www.sciencedirect.com/science/article/pii/B9781558603776500232 [11] BIAFORE S. Predictive solutions bring more power to decision makers[J]. Health Management Technology, 1999, 20(10):12. http://www.ncbi.nlm.nih.gov/pubmed/10622867 [12] MARCUSNEWHALL A, HALPERN D, TAN S J. Healthcare and data mining[J]. Health Management Technology, 2000. [13] 高臻耀, 张敬谊, 林志杰, 等.一个医保基金风险防控平台中的数据挖掘技术[J].计算机应用与软件, 2011, 28(8):120-122. http://www.cnki.com.cn/Article/CJFDTOTAL-JYRJ201108035.htm [14] ROBERTS S J, PENNY W, PILLOT D. Novelty, confidence and errors in connectionist systems[C]//Intelligent Sensors.[S.l.]:IET, 1996:10/1-10/6. [15] BREUNIG M M, KRIEGEL H P, NG R T, et al. OPTICS-OF:Identifying local outliers[J]. Lecture Notes in Computer Science, 1999, 1704:262-270. doi: 10.1007/b72280 [16] 黄洪宇, 林甲祥, 陈崇成, 等.离群数据挖掘综述[J].计算机应用研究, 2006, 23(8):8-13. http://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ200608002.htm [17] LIU B, YIN J, XIAO Y, et al. Exploiting local data uncertainty to boost global outlier detection[C]//IEEE International Conference on Data Mining.[S.l.]:IEEE Computer Society, 2010:304-313. [18] ESTER M, KRIEGEL H P, XU X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise[C]//International Conference on Knowledge Discovery and Data Mining. USA:AAAI Press, 1996:226-231. [19] NG R T, HAN J. Efficient and effective clustering methods for spatial data mining[C]//International Conference on Very Large Data Bases. San Francisco:Margan Kaufmann, 1994:144-155. [20] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH:An efficient data clustering method for very large databases[J]. ACM SIGMOD Record, 1999, 25(2):103-114. [21] SUN C F, SHI Y L, LI Q I, et al. A hybrid approach for detecting fraudulent medical insurance claims:(Extended abstract)[C]//Proceedings of the 2016 Interational Conference on Autonomous) Agents & Multiagent Systems. Singapore:IFAAMS, 2016:1287-1288. [22] MOYANO L G, APPEL A P, SANTANA V F D, et al. GraPhys:Understanding health care insurance data through graph analytics[C]//International Conference Companion on World Wide Web.[S.l.]:International World Wide Web Conferences Steering Committee, 2016:227-230. [23] BAUDER R A, KHOSHGOFTAAR T M. A novel method for fraudulent medicare claims detection from expected payment deviations (Application Paper)[C]//IEEE, International Conference on Information Reuse and Integration.[S.l.]:IEEE, 2016:11-19. [24] 关皓文. 基于离群点检测方法的医保异常发现[D]. 济南: 山东大学, 2016. http://cdmd.cnki.com.cn/Article/CDMD-10422-1016160032.htm [25] HE Z, XU X, DENG S. Squeezer:An efficient algorithm for clustering categorical data[J]. Journal of Computer Science and Technology, 2002, 17(5):611-624. doi: 10.1007/BF02948829