Improved LDA model for microblog topic mining
-
摘要: 随着新浪微博用户的不断增长,微博网站成为很多人获取信息的平台.但是微博是一种特殊的文本,其字数受到严格限制,传统的主题模型并不能很好地分析微博的内容.本文提出了一个基于LDA的微博生成模型RT-LDA来解决微博字数受限的问题.模型采用吉布斯抽样法来推导,不仅能准确地挖掘每条微博的主题,还能归纳出用户关注的主题分布情况.在真实数据集上的实验表明,RT-LDA模型能很好地对微博进行主题挖掘.Abstract: With the dramatic increase of Sina microblog users, microblog websites have been the platformsfor a wide spectrum of users to get information. Due to the fact that microblog is a special kind of text with the restricted length, traditional topic models could not be used to analyze the microblog content very well. RT-LDA, a microblog generation model based on LDA is proposed in this paper. Gibbs sampling is chosen to deduce the model, which can not only mine the topics of each microblog accurately but also induce the distribution of the concerned topics. RT-LDAs effective utility on topic mining of the microblogs is verified by the experiments on real data.
-
Key words:
- Sina microblog /
- text mining /
- RT-LDA /
- Gibbs sampling
-
[1] [1] ZHAO W X, HE J, YAN H F, et al. Comparing Twitter and traditional media using topic models[J]. Advances in Information Retrieval, Proceedings. 2011, 6611:338-349.[2] NOORDHUIS P, HEIJKOOP M, LAZOVIK A. Mining Twitter in the cloud: a case study[C]. Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference. 2010 July, 107-114.[3] KANG J H, LERMAN K, PLANGPRASOPCHOK A. Analyzing microblogs with affinity propagation [C]//Proc of the 1st KDD Workshop on Social Media Analytic. New York: ACM, 2010: 67-70.[4] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.[5] 张晨逸, 孙建伶, 丁轶群. 基于MB-LDA模型的微博主题挖掘[J]. 计算机研究与发展,2011, 48(10): 1795-1802.[6] RAMAGE D, DUMAIS S, LIEBLING D. Characterizing microblogs with topic models[C]. ICWSM, 2010:130-137.[7] 廉捷, 周欣, 曹伟, 刘云. 新浪微博数据挖掘方案[J]. 清华大学学报:自然科学版,2011 51(10): 1300-1305. [8] ZHANG H P, YU H K, XIONG D Y, et al. HHMM-based chinese lexical analyzer ICTCLAS[C]//Proc of the 2nd SigHan Workshop. 2003: 184-187.[9] DEERWESTER S, DUMAIS S, LANDAUER T. Indexing by latent semantic analysis[J]. Journal of the American Society of Information Science. 1990, 41(6):391-407.[10] HOFMANN T. Probabilistic latent semantic indexing[C]//Proc of the 22nd Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval.New York: ACM, 1999:50-57.[11] BLEI D M. Probabilistic topic models[C]. Communications of the ACM. 2012, 4:77-84.[12] BISHOP C M. Pattern Recognition and Machine Learning[M]. Germany: Springer, 2007.[13] PHILIP R, ERIC H. Gibbs sampling for the uninitiated[R]. Technical Reports from UMIACS, 2010, 6.[14] STEYVERS M, GRIFFITHS T. Probabilistic topic models[J]. Handbook of Latent Semantic Analysis, 2007, 427(7):424-440.[15] WENG J S, LIM E P, JIANG J, et al. TwitterRank: finding topic-sensitive influential Twitterers[C]//Proceedings of the third ACM WSDM, 2010.[16] GRIFFITHS T L, STEYVERS M. Finding scientific topics[C]//Proc of the National Academy of Sciences of the United States of America, 2004, 101: 5228-5235.[17] IDO D, LEE L, PEREIRA F. Similarity-based methods for word sense disambiguation[C]//Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 1997: 56-63.[18] KULLBACK S, LEIBLER R. A. On Information and sufficiency[C]. Annals of Mathematical Statistics, 1951, 22(1): 79-86.[19] HONG L, DAVISON B D. Empirical study of topic modeling in Twitter[C]//Proceedings of the SIGKDD Workshop on Social Media Analytics, 2010.
点击查看大图
计量
- 文章访问数: 3513
- HTML全文浏览量: 25
- PDF下载量: 2887
- 被引次数: 0