Research on an advertising click-through rate prediction model based on feature optimization
-
摘要: 针对互联网广告数据具有高维稀疏性的特点, 在现有的点击率(Click-Through Rate, CTR)预测问题的相关理论和技术基础上, 给出了一种基于梯度提升决策树(Gradient Boosting Decision Tree, GBDT)的卷积神经网络(Convolutional Neural Networks, CNN)在线广告特征提取模型(CNN Based on GBDT, CNN+). CNN+模型不仅能从原始数据中提取出深度高阶特征, 还能解决卷积神经网络在稀疏、高维特征中提取特征困难的问题. 在真实数据集上的实验结果表明, 与主成分分析(Principal Component Analysis, PCA)和梯度提升决策树这两种特征提取方法相比, CNN+模型提取的特征更加有效.Abstract: This paper proposes an online advertising feature extraction model of CNN (Convolutional Neural Networks) based on GBDT (Gradient Boosting Decision Tree) aimed at solving challenges with high-dimensional sparseness in Internet advertising data based on existing theories and technologies for click-through rate (CRT) prediction. The proposed model, CNN+, is able to extract deep, high-order features from raw data and solve the issues that convolutional neural networks face in extracting sparse and high-dimensional features. Experimental results on real datasets show that the features extracted by the CNN+ model are more effective than two other feature extraction methods studied, namely principal component analysis (PCA) and GBDT.
-
表 1 数据类别分布表
Tab. 1 Data category distribution
数据类别 字段名称 字段标签 基本数据 用户是否点击广告 click 广告特征 广告ID, 广告主ID, 订单ID, 广告主行业, 广告主名称, 活动ID, 创意ID, 创意类型, 样式定向ID, 素材是否有多层链接, 是否落地页跳转, 是否落地页下载, 是否为JavaScript素材, 是否是语音广告, 广告宽度, 广告高度 adid, advert_id, orderid, advert_industry_inner, advert_name, campaign_id, creative_id, creative_type, creative_tp_dnf, creative_has_deeplink, creative_is_jump, creative_is_download, creative_is_js, creative_is_voicead, creative_width, creative_height 媒体信息 App分类, 一级频道, 媒体ID, 媒体广告位, App是否付费 app_cate_id, f_channel, app_id, inner_slot_id, app_paid 上下文信息 用户所在城市, 运行商, 时间戳, 省份, 联网类型, 设备类型, 操作系统版本, 操作系统, 手机品牌, 手机机型 city, carrier, time, province, nnt, devtype, osv, os, make, model 表 2 CNN模型的参数
Tab. 2 Parameters for the CNN model
参数 值 卷积层数 4 卷积核大小 (4,4) 池化层大小 (2,2) Dropout值 0.1 优化方法 Adam 全连接层神经元数 128 表 3 不同模型参数选择表
Tab. 3 Parameters for different models
模型 参数 PCA-LR 主成分个数: 26 GBDT-LR GBDT子树: 21棵 CNN+-LR GBDT子树: 21棵; CNN参数: 如表2所示 -
[1] 高驰, 卢志茂. 在线广告发展态势与特性分析 [J]. 哈尔滨工业大学学报(社会科学版), 2003, 5(2): 122-125. [2] 周傲英, 周敏奇, 宫学庆. 计算广告: 以数据为核心的Web综合应用 [J]. 计算机学报, 2011, 34(10): 1805-1819. [3] RICHARDSON M, DOMINOWSKA E, RAGNO R. Predicting clicks: Estimating the click-through rate for new ads [C]// Proceedings of the 16th International Conference on World Wide Web. ACM, 2007: 521-530. [4] 沈方瑶, 戴国骏, 代成雷, 等. 基于特征关联模型的广告点击率预测 [J]. 清华大学学报(自然科学版), 2018, 58(4): 374-379. [5] 李春红, 吴英, 覃朝勇. 基于LASSO变量选择方法的网络广告点击率预测模型研究 [J]. 数理统计与管理, 2016, 35(5): 803-809. [6] YAN L, LI W J, XUE G R, et al. Coupled group lasso for Web-scale CTR prediction in display advertising [J]. Proceedings of Machine Learning Research, 2014, 32(2): 802-810. [7] HE X R, PAN J F, JIN O, et al. Practical lessons from predicting clicks on ads at Facebook [C]// Proceedings of the 8th International Workshop on Data Mining for Online Advertising, ADKDD 2014. ACM, 2014: 5:1-5:9. [8] 魏晓航, 于重重, 田嫦丽, 等. 大数据平台下的互联网广告点击率预估模型 [J]. 计算机工程与设计, 2017, 38(9): 2504-2508. [9] 张志强, 周永, 谢晓芹, 等. 基于特征学习的广告点击率预估技术研究 [J]. 计算机学报, 2016, 39(4): 780-794. DOI: 10.11897/SP.J.1016.2016.00780. [10] 杨长春, 梅佳俊, 吴云, 等. 基于特征降维和DBN的广告点击率预测 [J]. 计算机工程与设计, 2018, 39(12): 3700-3704. [11] CHENG H T, KOC L, HARMSEN J, et al. Wide & deep learning for recommender systems [C]// DLRS 2016: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016: 7-10. DOI: 10.1145/2988450.2988454. [12] ABDI H, WILLIAMS L. Principal component analysis [J]. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(4): 433-459. DOI: 10.1002/wics.101. [13] 肖垚, 毕军芳, 韩易, 等. 在线广告中点击率预测研究 [J]. 华东师范大学学报(自然科学版), 2017(5): 80-86. DOI: 10.3969/j.issn.1000-5641.2017.05.008. [14] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks [C]//NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems- Volume 1. New York:Curran Associates Inc., 2012: 1097-1105. [15] MA L, LU Z D, SHANG L F, et al. Multimodal convolutional neural networks for matching image and sentence [C]// 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015: 2623-2631. DOI: 10.1109/ICCV.2015.301. [16] LOBO J M, JIMÉNEZ-VALVERDE A, REAL R. AUC: A misleading measure of the performance of predictive distribution models [J]. Global Ecology and Biogeography, 2008, 17(2): 145-151. DOI: 10.1111/j.1466-8238.2007.00358.x.