基于隐变量模型的多维用户偏好建模

王珊蕾; 岳昆; 武浩; 田凯琳

doi:10.3969/j.issn.1000-5641.2017.05.013

基于隐变量模型的多维用户偏好建模

doi: 10.3969/j.issn.1000-5641.2017.05.013

王珊蕾^1,,
岳昆^1, ,,
武浩¹,
田凯琳²

1.
云南大学信息学院, 昆明 650500
2.
西南林业大学图书馆, 昆明, 650224

基金项目:

国家自然科学基金 61472345

国家自然科学基金 61562090

教育部"云数融合、科教创新"基金 2017B000

第二批"云岭学者"培养项目 C6153001

云南省应用基础研究计划重点项目 2014FA023

云南大学青年英才培育计划 WX173602

中国博士后科研基金 2016M592721

详细信息

作者简介:
王珊蕾, 男, 硕士研究生, 研究方向为海量数据分析与知识发现.E-mail:407773704@qq.com

通讯作者:
岳昆, 男, 博士, 教授, 博士生导师, 研究方向为海量数据分析与知识发现.E-mail:kyue@ynu.edu.cn

中图分类号: TP311
计量
- 文章访问数: 215
- HTML全文浏览量: 122
- PDF下载量: 263
- 被引次数: 0
出版历程
- 收稿日期: 2017-05-01
- 刊出日期: 2017-09-25

Modeling multi-dimensional user preference based on the latent variable model

1.
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
2.
Library of South West Forestry University, Kunming 650224, China

摘要

摘要: 从用户行为数据构建用户偏好模型，是解决个性化服务、评分预测和用户行为定向等问题的重要基础.本文从用户的评分数据出发，以多个隐变量分别描述用户在评分对象多个维度的偏好，以含有多个隐变量的贝叶斯网（简称隐变量模型）作为表示用户偏好的基本知识框架.首先根据用户偏好和隐变量的特定含义给出模型构建的约束条件，进而提出基于约束条件的模型构建方法，使用约束条件下的EM算法来计算模型参数，约束条件下的SEM算法来构建模型结构.针对多隐变量情形下模型构建过程中产生大量中间数据带来的计算复杂度急剧上升的问题，本文使用Spark计算框架实现模型构建的方法.建立在Movielens数据集上的实验表明，本文提出的方法是有效的.
- 评分数据 /
- 多维偏好 /
- 隐变量 /
- 贝叶斯网 /
- Spark
Abstract: Modeling user preference from user behavior data is the basis of personalization service, score prediction, user behavior targeting, etc. In this paper, multi-dimensional preferences from rating data are described by multiple latent variables and the Bayesian network with multiple latent variables is adopted as the preliminary knowledge framework of user preference. Constraint conditions are given according to the inherence of user preference and latent variables, upon which we propose a method for modeling user preference. Parameters are computed by EM algorithm and structure is established by SEM algorithm with respect to the given constraints. In the case of multiple latent variables, a large amount of intermediate data is generated in modeling, which causes the increasing computational complexity. Therefore, we implement the modeling method with Spark computing framework. Experiments results on the Movielens dataset verify that the method proposed in this paper is effective.
- rating data /
- multi-dimensional preference /
- latent variable /
- Bayesian network /
- Spark

HTML全文

图 1 结构约束

Fig. 1 Structural constraint

下载: 全尺寸图片幻灯片

图 2 算法1的流程

Fig. 2 Process of Algorithm

下载: 全尺寸图片幻灯片

图 3 当前结构和部分参数

Fig. 3 Current structure and partial parameters

下载: 全尺寸图片幻灯片

图 4 参数集

Fig. 4 Set of parameters

下载: 全尺寸图片幻灯片

图 5 当前结构和参数

Fig. 5 Current structure and parameters

下载: 全尺寸图片幻灯片

图 6 候选结构

Fig. 6 Candidate structure

下载: 全尺寸图片幻灯片

图 7 算法1的执行时间

Fig. 7 Execution time of Algorithm 1

下载: 全尺寸图片幻灯片

图 8 算法1的加速比

Fig. 8 Speedup ratio of Algorithm 1

下载: 全尺寸图片幻灯片

图 9 隐变量个数增加时算法1的执行时间

Fig. 9 Execution time of Algorithm 1 with the increase of latent variables

下载: 全尺寸图片幻灯片

图 10 隐变量父节点数增加时算法1的执行时间

Fig. 10 Execution time of Algorithm 1 with the increase of parent nodes of the latent variable

下载: 全尺寸图片幻灯片

图 11 算法2的执行时间

Fig. 11 Execution time of Algorithm 2

下载: 全尺寸图片幻灯片

图 12 算法2的加速比

Fig. 12 Speedup ratio of Algorithm 2

下载: 全尺寸图片幻灯片

图 13 执行时间对比

Fig. 13 Comparison of execution time

下载: 全尺寸图片幻灯片

图 14 准确性

Fig. 14 Precision

下载: 全尺寸图片幻灯片

图 15 覆盖率

Fig. 15 Coverage

下载: 全尺寸图片幻灯片

图 16 F值

Fig. 16 F score

下载: 全尺寸图片幻灯片

图 17 准确性对比

Fig. 17 Comparison of precision

下载: 全尺寸图片幻灯片

图 18 覆盖率对比

Fig. 18 Comparison of coverage

下载: 全尺寸图片幻灯片

图 19 F值对比

Fig. 19 Comparison of F scores

下载: 全尺寸图片幻灯片

算法1: CPT-Learn( $\varphi^0$ , $D$ , $\theta^0$ , $\partial$ , $T$ )

输入: $\varphi^0$ :当前模型结构

$D$ :当前数据集

$\theta^0$ :当前参数(初次迭代则为空)

$\partial$ :相似度阈值

$T$ :迭代次数

输出: $D'$ :完整数据集

$\theta'$ :模型参数

1: $\varphi\leftarrow \varphi^0$ //初始化结构

2: if $\theta^0$ 为空then

3:随机产生一组满足约束2的初始参数 $\theta'$

4: $\theta^0\leftarrow \theta'$ //将初始参数作为当前参数

5: end if

6: for $t\leftarrow0$ to $T$ do

7: pair $\leftarrow D$ .map{Line= $>$ //对每一行数据操作

key $\leftarrow$ 填充该行数据的缺值

value $\leftarrow$ 根据公式(2) 计算当前填充组合的概率

Emit(key, value)

}

8:Rpair $\leftarrow$ pair.flatMapValues{Line= $>$ //对pair中的每个键值对操作

for $i\leftarrow$ 0 to key.length do //遍历key中的每一个值

key $\leftarrow(V_i=v_i,\pi(V_i)=j)$

// $V_i$ 是key中第 $i$ 个值 $v_i$ 对应的节点编号, $\pi(V_i)$ 是节点 $V_i$ 的父节点集合

Emit(key, value)

end for

}

9: $m^{i}_{ijk}\leftarrow$ Rpair.reducebykey() //按key求和

10: $\theta^{t+1}\leftarrow$ 根据公式(4) 计算参数

11: if $S(\theta^{t},\theta^{t+1})＜\partial$ then //根据公式(5) 计算 $S$

12: return $\theta^{t+1}$

13: end if

14: end for

下载: 导出CSV

表 1 数据集片断

Tab. 1 Fragment of data set

样本ID	$U_1$	$L_1$	$L_2$	$I_1$	$I_2$	$R$
0	0			0	1	1
1	1			0	1	0
$\cdots$	$\cdots$			$\cdots$	$\cdots$	$\cdots$

下载: 导出CSV

表 2 填充后数据

Tab. 2 Filled data

样本ID	$U_1$	$L_1$	$L_2$	$I_1$	$I_2$	$R$
0-1	0	0	0	0	0	1
0-2	0	1	0	0	1	1
0-3	0	0	1	0	1	1
0-4	0	1	1	0	1	1

下载: 导出CSV

算法2: DAG-Learn( $\varphi^0$ , $D$ , $\theta^0$ , $\partial$ , $T$ )

输入: $\varphi^0$ :当前结构

$\theta^0$ :当前参数

$D$ :数据集( $\vert D\vert =N)$

$\partial$ :参数相似度阈值

$T$ :迭代次数

V_num:节点个数

输出: ( $\varphi, \theta$ ):隐变量模型

1: if $\varphi^0$ 和 $\theta^0$ 为空then

2:随机产生满足约束1的BN结构 $\varphi'$ ; 随机产生满足约束2的初始参数 $\theta'$

3: $(\theta^0,\varphi^0)\leftarrow(\theta',\varphi')$ //将初始结构和参数作为当前结构和当前参数

4: end if

5: $(\varphi,\theta)\leftarrow (\varphi',\theta')$

6: $D^0\leftarrow$ CPT-Learn( $\varphi^0$ , $D$ , $\theta^0$ , $\partial$ , 1) //调用算法1填充数据

7: oldScore $\leftarrow $ BICe( $\varphi$ , $\theta|D^0$ )//计算当前BIC分数

8: newScore $\leftarrow -\infty$

9: for $i\leftarrow$ 0 to $V_{num}-1$ do

10: $c\_{set}\leftarrow$ 对当前结点加、减或转边得到一系列候选结构

11: Rpair $\leftarrow c\_{set}$ .map{Line= $>$ //对 $c\_{set}$ 中每一个候选结构进行操作

$\theta^i$ , $D^I\leftarrow$ CPT-Learn( $\varphi^i$ , $D^i$ , $\theta^i$ , $\partial$ , 1) //一次参数计算

key $\leftarrow (\varphi^i, \theta^i)$

value $\leftarrow$ BICe( $\varphi^i$ , $\theta^i|D^i$ ) //计算候选结构BIC分数

Emit(key, value)

}

12: Rpair.foreach{Line= $>$ //挑选BIC分数最大的候选模型

if value $>$ newScore then

$(\varphi^i, \theta^i)\leftarrow$ key

newScore $\leftarrow$ value

end if

}

13: if newScore $>$ oldScore then

14: ( $\theta^{i+1}$ , $D^{i+1})\leftarrow$ CPT-Learn( $\varphi^i$ , $D^i$ , $\theta^i$ , $\partial$ , $T$ ) // $T$ 次参数计算

15: ( $\varphi,\theta)\leftarrow(\varphi^{i+1},\theta^{i+1})$

16: oldScore $\leftarrow$ BICe( $\varphi$ , $\theta|D^{i+1}$ ) //更新BIC分数

17: else

18: return $(\varphi,\theta)$

19: end if

20: end for

下载: 导出CSV

参考文献(25)

[1]	GROUPLENS RESEARCH.MovieLensDataset[EB/OL].[2017-08-20]. http://grouplens.org/datasets/movielens/.
[2]	KASSAK O, KOMPAN M, BIELIKOVA M. User preference modeling by global and individual weights for personalized recommendation[J]. Acta Polytechnica Hungarica, 2015, 12(8):27-41 http://www.uni-obuda.hu/journal/Kassak_Kompan_Bielikova_64.pdf
[3]	YIN H, CUI B, CHEN L, et al. Modeling location-based user rating profiles for personalized recommendation[J]. ACM Transactions on Knowledge Discovery from Data, 2015, 9(3):1-41. doi: 10.1145/2663356
[4]	YUAN Q, CONG G, MA Z, et al. Who, where, when and what:Discover spatio-temporal topics for Twitter users[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013:605-613.
[5]	ZHAO K, CONG G, YUAN Q, et al. SAR:A sentiment-aspect-region model for user preference analysis in geo-tagged reviews[C]//IEEE, International Conference on Data Engineering. IEEE, 2015:675-686.
[6]	JIANG B, LIANG J, SHA Y, et al. Retweetingbehavior prediction based on one-class collaborative filtering in social networks[C]//Proceedings of the 39th International ACM Special Interest Group on Information Retrievalconference on Research and Development in Information Retrieval. ACM, 2016:977-980.
[7]	张连文, 郭海鹏.贝叶斯网引论[M].北京:科学出版社, 2006.
[8]	DAPHNEKOLLER, NIRFRIEDMAN, 科勒, 等.概率图模型[M].北京:清华大学出版社, 2015.
[9]	KIM J S, JUN C H. Ranking evaluation of institutions based on a Bayesian network having a latent variable[J]. Knowledge-Based Systems, 2013, 50:87-99. doi: 10.1016/j.knosys.2013.05.010
[10]	SCHÜTZ W, SCHÄFER R. Bayesian networks for estimating the user's interests in the context of a configuration task[C]//Proceedings of the UM2001 Workshop on Machine Learning for User Modeling, 2001:13-17.
[11]	FRIEDMAN N. The Bayesian structural EM algorithm[C]//Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, 2013:129-138.
[12]	ELIDAN G, FRIEDMAN N. Learning hidden variable networks:The information bottleneck approach[J]. Journal of Machine Learning Research, 2005, 6(6):81-127. http://www.jmlr.org/papers/volume6/elidan05a/elidan05a.pdf
[13]	JIN C, ZHANG Y, BALAKRISHNAN S, et al. Local maxima in the likelihood of gaussian mixture models:Structural results and algorithmic consequences[J]. Advances in Neural Information Processing Systems, 2016, 4(1):16-24. https://scirate.com/arxiv/1609.00978
[14]	ZHAO G, QIANX, XIE X. User-service rating prediction by exploring social users' rating behaviors[J]. IEEE Transactions on Multimedia, 2016, 18(3):496-506. doi: 10.1109/TMM.2016.2515362
[15]	高全力, 高岭, 杨建锋, 等.上下文感知推荐系统中基于用户认知行为的偏好获取方法[J].计算机学报, 2015(9):1767-1776. doi: 10.11897/SP.J.1016.2015.01767
[16]	王红兵, 孙文龙, 王华兰. Web服务选择中偏好不确定问题的研究[J].计算机学报, 2013, 36(2):275-285. http://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201302006.htm
[17]	史艳翠, 孟祥武, 张玉洁, 等.一种上下文移动用户偏好自适应学习方法[J].软件学报, 2012, 23(10):2533-2549. http://www.cnki.com.cn/Article/CJFDTOTAL-RJXB201210002.htm
[18]	GAO R, YUE K, WU H, et al. Modeling user preference from rating data based on the bayesian network with a latent variable[C]//Proceedings of 17th International Conference on Web-Age Information Management, 2016:3-16.
[19]	HUANG Y, BIAN L. A Bayesian network and analytic hierarchy process based personalized recommendations for tourist attractions over the Internet[J]. Expert Systems with Applications, 2009, 36(1):933-943. doi: 10.1016/j.eswa.2007.10.019
[20]	CHAPELLE O, ZHANG Y. A dynamic Bayesian network click model for web search ranking[C]//Proceedings of the 18th International Conference on World Wide Web, ACM, 2009:1-10.
[21]	HUETE J, DE CAMPOS L M, FERNANDEZ-LUNA J M, et al. Using structural content information for learning user profiles[C]//Proceedings of 30th Special Interest Group on Information Retrieval, 2007:38-45.
[22]	AUFFENBERG F, STEIN S, ROGERS A. A personalised thermal comfort model using a Bayesian network[C]//Proceedings of the 2015 International Joint Conference on Artificial Intelligence, 2015:130-139.
[23]	YUE K, FANG Q, WANG X, et al. A parallel and incremental approach for data-intensive learning of Bayesian networks[J]. IEEE Transactions on Cybernetics, 2015, 45(12):2890-2909. doi: 10.1109/TCYB.2015.2388791
[24]	TAMADA Y, IMOTO S, MIYANO S. Parallel algorithm for learning optimal Bayesian network structure[J]. Journal of Machine Learning Research, 2011, 12(7):2437-2459.
[25]	MIT. FullBNT[CP/OL].[2017-08-20]. http://www.cs.ubc.ca/murphyk/Software/BNT/FullBNT-1.0.4.Zip.

施引文献

资源附件(0)

访问统计

点击查看大图

图(19) / 表(4)

计量

文章访问数: 215
HTML全文浏览量: 122
PDF下载量: 263
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于隐变量模型的多维用户偏好建模

doi: 10.3969/j.issn.1000-5641.2017.05.013

作者简介:
王珊蕾, 男, 硕士研究生, 研究方向为海量数据分析与知识发现.E-mail:407773704@qq.com

通讯作者:
岳昆, 男, 博士, 教授, 博士生导师, 研究方向为海量数据分析与知识发现.E-mail:kyue@ynu.edu.cn

计量

Modeling multi-dimensional user preference based on the latent variable model

计量

目录

留言板

基于隐变量模型的多维用户偏好建模

doi: 10.3969/j.issn.1000-5641.2017.05.013

作者简介: 王珊蕾, 男, 硕士研究生, 研究方向为海量数据分析与知识发现.E-mail:407773704@qq.com

通讯作者: 岳昆, 男, 博士, 教授, 博士生导师, 研究方向为海量数据分析与知识发现.E-mail:kyue@ynu.edu.cn

计量

出版历程

Modeling multi-dimensional user preference based on the latent variable model

计量

出版历程

目录

作者简介:
王珊蕾, 男, 硕士研究生, 研究方向为海量数据分析与知识发现.E-mail:407773704@qq.com

通讯作者:
岳昆, 男, 博士, 教授, 博士生导师, 研究方向为海量数据分析与知识发现.E-mail:kyue@ynu.edu.cn