A public opinion analysis model based on Danmu data monitoring and sentiment classification
-
摘要: 随着在线视频平台的快速发展,弹幕逐渐成为人们表达观点的一个重要途径,尤其受到年轻人的欢迎.与常规的文本不同,弹幕文本普遍较短,表达随意,网络词汇较多,一些常规的停用词被用于表达情感.提出了一种基于弹幕数据的舆情分析模型,针对弹幕数据生成和存储特点,提出了热点检测循环自适应弹幕数据获取算法;扩充了情感词典来区分弹幕中情感倾向数据和中性数据,以解决弹幕中出现的网络词汇较多的问题;基于卷积神经网络(Convolutional Neural Network,CNN)建立了情感褒贬分类模型,用来区分情感倾向弹幕的正负情感倾向,在此基础上得到了舆情分析的结果.实验表明,本文的舆情分析模型能有效地表达新闻类弹幕数据的舆情分析结果.Abstract: With the rapid development of online video platforms, Danmu has gradually become an important way for people to express their opinions, and it is particularly welcomed by young people. Unlike conventional texts, Danmu texts are generally short, unstructured, and involve Internet slang as well as conventional stop words to express emotions. In this paper, a public opinion analysis model based on Danmu data is proposed. According to the data generation and storage characteristics of Danmu, a hotspot detection-based loop algorithm is proposed for Danmu data collection. Moreover, the sentiment dictionary to distinguish emotional tendencies is expanded to include network vocabularies commonly appearing in Danmu. Finally, based on the convolutional neural network (CNN), we build a classification model to distinguish positive and negative emotions. Experiments show that the public opinion analysis model of this paper can effectively demonstrate public opinion analysis of Danmu data.
-
Key words:
- Danmu emotion /
- Internet sentiment /
- emotion classification /
- deep learning /
- web crawler
-
算法1:热点检测循环自适应数据获取算法 FUNCTION hotpot_cycle () 输入:无 输出:无 1: WHILE true DO 2: $x$ = 1 //初始化采集次数为1 3: span_time = span($x$) //初始化采集间隔时间 4: WHILE true DO 5: sleep(span_time) //等待span_time时间 6: danmus = spider_get_danmus() //获取弹幕池中的所有弹幕数据 7: pool_size = spider_get_pool_size() //获取弹幕池大小 8: num_of_danmu = compare_and_storge(danmus) //弹幕去重计算新弹幕的数量 9: IF (num_of_danmu $>$ $\partial$($n$, span_time)) or $x>20$ THEN //当弹幕数量大于阈值 或循环次数大于20次时, 跳出此层循环, 回到上层循环 10: break 11: END 12: x++ 13: span_time = span(x) 14: END 15: END 算法2:爬取时间计算函数 FUNCTION span (x) 输入:爬虫爬取次数 输出:爬虫爬取的间隔时间 1: IF $x\leq 6$ THEN 2: return 2^($\pmb x-1$) 3: END 4: return span(6)+2($\pmb x-6$) //递归调用 算法3:阈值计算函数 FUNCTION $\partial$ ($\pmb n, \bf span$) 输入:弹幕池大小 当前爬取时间间隔 输出:弹幕增长阈值 1: return $\pmb n\times$span/60 表 1 微博评论标记数据
Tab. 1 Weibo comment tagged data
标记 评论 正向 鼓掌 超厉害 正向 更多选择, 更多欢笑 嘻嘻 正向 这个可以有$\sim \sim $哈哈 负向 挤公交挤死 衰 负向 呵呵, 头像撞了 抓狂 负向 去你妈得周末 怒 表 2 3组随机弹幕数据
Tab. 2 Three sets of random Danmu data
组别 文本总数 正向数量 负向数量 第一组 300 150 150 第二组 300 150 150 第三组 300 150 150 表 3 微博数据分类模型评价
Tab. 3 Evaluation of a Weibo data classification model
模型 精度/% 查准率/% 查全率/% $F1$-score/% NB 79.4 76.3 72.8 74.5 SVM 84.3 85.3 83.6 84.4 CNN 89.7 89.8 89.6 89.7 表 4 不同分类模型在弹幕情感分类中的性能评价
Tab. 4 Evaluation of different classification models in Danmu sentiment classification
模型 组别 有效性 精度/% 查准率/% 查全率/% $F1$-score/% NB 第一组 62.3 66.7 61.3 63.9 第二组 68.7 71.3 67.7 69.5 第三组 59.6 64.0 58.9 61.3 平均 63.5 67.3 62.6 64.9 SVM 第一组 75.3 80.0 73.2 76.4 第二组 74.7 76.0 74.0 75.0 第三组 72.7 75.4 71.5 73.4 平均 74.2 77.1 72.9 74.9 CNN 第一组 82.3 84.7 81.4 83.0 第二组 87.7 89.3 86.5 87.9 第三组 84.3 81.3 86.5 83.8 平均 84.8 85.5 84.8 84.9 -
[1] 燕道成.新媒体对"95后"青年文化的解构[J].当代青年研究, 2017(6):35-40. doi: 10.3969/j.issn.1006-1789.2017.06.006 [2] HE M, GE Y, CHEN E, et al. Exploring the Emerging Type of Comment for Online Videos:DanMu[J]. ACM Transactions on the Web (TWEB), 2017, 12(1):Article No 1. [3] JIA A L, SHEN S, CHEN S, et al. An analysis on a YouTube-like UGC site with enhanced social features[C]//Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 2017:1477-1483. https://www.researchgate.net/publication/322410722_An_Analysis_on_a_YouTube-like_UGC_site_with_Enhanced_Social_Features [4] 郑飏飏, 徐健, 肖卓.情感分析及可视化方法在网络视频弹幕数据分析中的应用[J].数据分析与知识发现, 2016, 31(11):82-90. doi: 10.11925/infotech.1003-3513.2016.11.10 [5] 邓扬, 张晨曦, 李江峰.基于弹幕情感分析的视频片段推荐模型[J].计算机应用, 2017, 37(4):1065-1070. http://d.old.wanfangdata.com.cn/Periodical/jsjyy201704028 [6] THELWALL M, BUCKLEY K, PALTOGLOU G. Sentiment strength detection for the social web[J]. Journal of the Association for Information Science and Technology, 2012, 63(1):163-173. http://cn.bing.com/academic/profile?id=9dcd5b6be4e78af0fd60addc5f38a913&encoded=0&v=paper_preview&mkt=zh-cn [7] SAIF H, HE Y, FERNANDEZ M, et al. Contextual semantics for sentiment analysis of Twitter[J]. Information Processing & Management, 2016, 52(1):5-19. http://cn.bing.com/academic/profile?id=fc065f4944c8509ef41529307252b541&encoded=0&v=paper_preview&mkt=zh-cn [8] YUAN D, ZHOU Y Q, LI R F, et al. Sentiment analysis of microblog combining dictionary and rules[C]//Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on. IEEE, 2014:785-789. https://www.researchgate.net/publication/286107900_Sentiment_analysis_of_microblog_combining_dictionary_and_rules [9] 刘志明, 刘鲁.基于机器学习的中文微博情感分类实证研究[J].计算机工程与应用, 2012, 48(1):1-4. doi: 10.3778/j.issn.1002-8331.2012.01.001 [10] 秦锋, 王恒, 郑啸, 等.基于上下文语境的微博情感分析[J].计算机工程, 2017, 34(3):241-246. doi: 10.3969/j.issn.1000-3428.2017.03.040 [11] 陈龙, 管子玉, 何金红, 等.情感分类研究进展[J].计算机研究与发展, 2017:54(6):1150-1170. http://d.old.wanfangdata.com.cn/Periodical/jsjyjyfz201706003 [12] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2010:841-842. https://www.researchgate.net/publication/221300153_Short_text_classification_in_twitter_to_improve_information_filtering [13] KIM Y. Convolutional neural networks for sentence classification[J] arXiv preprint, 2014, arXiv:1408.5882. [14] LAI S W, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015:2267-2273. [15] HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. arXiv preprint, arXiv:1207.0580, 2012. https://arxiv.org/abs/1207.0580