Algorithm for video click-through rate prediction
-
摘要: 点击率预测技术在视频推荐系统中具有重要的作用.视频推荐系统可以根据点击率预测的结果调整投放顺序,从而提高用户的真实点击率.在点击率预测问题中,由于数据存在海量性以及不平衡性等问题,点击率预测的精确度一般都较低.针对以上问题,使用特征工程和机器学习相结合的方法,有效地改进了现有的视频点击率预测算法的性能.首先,使用特征工程方法,从原始数据中提取特征,并使用矩阵分解等方法生成交叉特征;然后,分别基于逻辑回归、因子分解机和梯度提升决策树-逻辑回归实现点击率预测模型.实验结果表明,基于因子分解机模型和基于梯度提升决策树-逻辑回归模型的预测精度要优于基于逻辑回归的模型,并且将用户特征和视频特征进行交叉组合能够改进点击率预测的精度.Abstract: Click-through rate prediction has played an important role in video recommendation systems. A video recommendation system can suggest media to users based on the results of click-through rate prediction. In this way, users may be more likely to click the videos recommended by platforms. However, given the volume and imbalance of data in some applications, the accuracy of click-through rate prediction may be very low. To improve the performance, this paper proposes an integrated approach by combining feature engineering with techniques from machine learning. In the first stage, the algorithm uses feature engineering to extract user, video, and combinational features from the original dataset. In the second stage, the algorithm predicts the click-through rate by employing supervised models of logistic regression, factorization machine, and gradient boosting decision tree combined with logistic regression. The experimental results illustrate that the prediction accuracy of the factorization machine model and the gradient boosting decision tree combined with logistic regression model are better than the logistic regression model. Moreover, the cross combination of user and video features can improve the accuracy of the click-through rate prediction.
-
表 1 特征数量
Tab. 1 Number of features
特征 特征数量 用户特征 用户自身属性 4 用户兴趣偏好 460 用户设备属性 28 视频特征 视频类型特征 99 视频基本信息 278 组合特征 交叉特征 8 表 2 用户和视频特征
Tab. 2 User and video features
特征类型 特征内容 用户特征 用户自身属性 性别、年龄、学历 用户兴趣偏好 兴趣标签、视频类型、导演、演员、地区 用户设备属性 CPU核数、内存、默认浏览器、共存安全软件 视频特征 视频类型特征 大类类型、子类型 视频基本信息 演员、导演、上映年份、地区 表 3 用户感兴趣的演员标签
Tab. 3 Tags for actors of user-interest
演员标签 海清%27#陈数%24#邬君梅%38#王源%54#易烊千玺%18#胡先煦%41#汪俊%34 表 4 用户感兴趣的演员编码
Tab. 4 Encoding of tags for actors of user-interest
唐嫣 吴建豪 … 王源 … 易烊千玺 … 0 0 0 54 0 18 0 表 5 实验结果1
Tab. 5 Result 1
模型及算法 Precision Recall Fl-Score Log loss AUC Time/ms FM 0.13 0.64 0.22 0.130 4 0.862 57 968 GBDT+LR 0.09 0.74 0.16 0.134 7 0.883 4 231 LR 0.08 0.14 0.10 0.204 2 0.682 409 NB 0.01 0.21 0.01 2.441 7 0.626 1 624 DT 0.01 0.77 0.01 15.121 0.666 3 471 表 6 实验结果2
Tab. 6 Result 2
模型及算法 Precision Recall F1-Score Log loss AUC Time(ms) FM 0.10 0.81 0.18 0.102 0 0.941 393 332 GBDT+LR 0.09 0.68 0.16 0.166 6 0.861 275 293 LR 0.02 0.35 0.04 0.215 8 0.803 2 628 NB 0.001 0.19 0 1.445 5 0.654 5 009 DT 0.001 0.77 0 15.13 0.668 33 295 表 7 实验结果3
Tab. 7 Result 3
模型 Precision Recall Fl-Score Log loss AUC Time/ms FM 0.16 0.90 0.28 0.162 3 0.979 53 065 GBDT+LR 0.14 0.94 0.24 0.139 7 0.987 4 154 LR 0.10 0.93 0.19 0.238 4 0.971 365 表 8 推荐成功率
Tab. 8 Accuracy of recommendations
点击视频次数 推荐成功率 > 20 0.03 > 30 0.022 > 40 0.07 > 60 0.08 表 9 推荐电影和推荐电视剧的成功率
Tab. 9 Accuracy of movie and TV recommendations
点击视频次数 推荐成功率 电影 电视剧 > 20 0.02 0.05 > 30 0.013 0.018 > 40 0.07 0.08 > 60 0.086 0.085 -
[1] RENDLE S. Factorization machines[C]//IEEE International Conference on Data Mining. IEEE Computer Society, 2010: 995-1000. [2] FRIEDMAN J H. Greedy function approximation:A gradient boosting machine[J]. Annals of Statistics, 2001, 29(5):1189-1232. http://projecteuclid.org/Dienst/getRecord?id=euclid.aos/1013203451/ [3] HE X, PAN J, JIN O, et al. Practical lessons from predicting clicks on ads at Facebook[C]//Proceedings of the 8th International Workshop on Data Mining for Online Advertising. ACM, 2014: 1-9. [4] 纪文迪, 王晓玲, 周傲英.广告点击率估算技术综述[J].华东师范大学学报(自然科学版), 2013(3):1-14. [5] RICHARDSON M, DOMINOWSKA E, RAGNO R. Predicting clicks: Estimating the click-through rate for new ads[C]//International Conference on World Wide Web. ACM, 2007: 521-530. [6] CHAPELLE O, ZHANG Y. A dynamic bayesian network click model for web search ranking[C]//International Conference on World Wide Web. ACM, 2009: 1-10. [7] GRAEPEL T, CANDELA J Q, BORCHERT T, et al. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft's Bing Search engine[C]//International Conference on Machine Learning. DBLP, 2010: 13-20. [8] JOACHIMS T. Optimizing search engines using click-through data[C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2002: 133-142. [9] SHAN L, LIN L, SUN C, et al. Predicting ad click-through rates via feature-based fully coupled interaction tensor factorization[J]. Electronic Commerce Research & Applications, 2016, 16(C):30-42. http://www.sciencedirect.com/science/article/pii/S1567422316000144 [10] YAN L, LI W J, XUE G R, et al. Coupled group lasso for web-scale CTR prediction in display advertising[C]//International Conference on Machine Learning. 2014: 802-810. [11] AGARWAL D, LONG B, TRAUPMAN J, et al. LASER: A scalable response prediction platform for online advertising[C]//ACM International Conference on Web Search and Data Mining. ACM, 2014: 173-182. [12] AQUIAR E, NAGRECHA S, CHAWLA N V. Predicting online video engagement using clickstreams[C]//IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2015. DOI: 10.1109/DSAA.2015.7344873. [13] 李思琴, 林磊, 孙承杰.基于卷积神经网络的搜索广告点击率预测[J].智能计算机与应用, 2015(5):22-25. http://www.cqvip.com/QK/94259A/201505/666610458.html [14] SCHAPIRE R E. A brief introduction to boosting[C]//16th International Joint Conference on Artificial Intelligence. [S. l. ]: Morgan Kaufmann Publishers Inc, 1999: 1401-1406. [15] QUINLAN J R. Induction on decision tree[J]. Machine Learning, 1986(1):81-106. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5454676 [16] HARTIGAN J A, WONG M A. Algorithm AS 136:A k-means clustering algorithm[J]. Applied Statistics, 1979, 28(1):100-108. doi: 10.2307/2346830 [17] BREIMAN L. Out-of-bag estimation[R]. Berkeley: University of California, 1996. [18] BREIMAN L. Bagging Predictors[M].[S.l.]:Kluwer Academic Publishers, 1996. [19] CHEN T, GUESTRIN C. XGBoost: A scalable tree boosting system[C]//ACM SIGKDD International Conference. ACM, 2016: 785-794.