Efficient implementation for LDA in Mahout
-
摘要: 通过对运用\,Gibbs\,采样的\,Latent Dirichlet Allocation (LDA)\,算法和\,MapReduce\,计算框架的细致研究, 实现了\,LDA\,算法在\,Mahout 下的分布式并行计算. 详细地考察了该分布式并行计算程序的计算性能, 并深入地探讨了一些影响计算性能的关键问题.
-
关键词:
- Latent Dirichlet Allocation /
- Gibbs\,采样 /
- Mahout /
- 分布式并行计算 /
- MapReduce\,计算框架
Abstract: In a careful study of Latent Dirichlet Allocation (LDA) using Gibbs sampling and the MapReduce framework, an efficient implementation for LDA in Mahout was achieved. The experiments showed the high performance of this distributed parallel LDA program, and several issues about enhancing performance were discussed. -
[1] {1} BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003 (3): 993-1022. {2} GRIFFITHS T L, STEYVERS M. Finding scientific topics[J]. Proceedings of the National Academy of Sciences, 2004(101): 5228-5235. {3} VENNER J. Pro Hadoop[M]. New York: Apress, 2009. {4} OWEN S, ANIL R, DUNNING T, FRIEDMAN E. Mahout in Action[M]. New York: Manning Publications, 2010. {5} STEYVERS M, GRIFFITHS T. Probabilistic topic models[M]//LANDAUER T, MCNAMARA D, DENNIS S, et al. Latent Semantic Analysis: A Road to Meaning.[s.l.]:Routledge, 2007. {6} HEINRICH G. Parameter estimation for text analysis[R]. Darmstadt: Fraunhofer IGD, 2004. {7} NEWMAN D, ASUNCION A, SMYTH P, WELLING M. Distributed inference for latent Dirichlet allocation[J]. Proc Neural Information Processing Systems, 2007(20): 1081-1088. {8} WANG Y, BAI H J, STANTON M, et al. PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications[M]. Lecture Notes in Computer Science 5564. Berlin: Springer, 2009: 301-314. {9} GRIFFITHS T L, STEYVERS M. A probabilistic approach to semantic representation[C]// Proceedings of the Twenty-Fourth Annual Conference of Cognitive Science Society, 2002. {10} LIU Z Y, ZHANG Y Z, CHANG E Y. PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing[J]. ACM Transactions on Intelligent Systems and Technology, 2011(2): 26. {11} SMOLA A, NARAYANAMURTHY S. An architecture for parallel topic models[J]. Proceedings of the VLDB Endowment, 2010(3): 703-710. {12} EKANAYAKE J, LI H, ZHANG B J, et al. Twister: a runtime for iterative MapReduce[J]. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010(1): 810-818. {13} BU Y Y, HOWE B, BALAZINSKA M, et al. HaLoop: efficient iterative data processing on large clusters[J].Proceedings of the VLDB Endowment, 2010(3): 285-296.
点击查看大图
计量
- 文章访问数: 5168
- HTML全文浏览量: 73
- PDF下载量: 3386
- 被引次数: 0