Performance prediction based on fuzzy clustering and support vector regression
-
摘要: 现有的成绩预测模型往往过度使用不同类型的属性,导致过于复杂的分数预测方法,或是需要人工参与.为提高学生成绩预测的准确率和可解释性,提出了一种融合模糊聚类和支持向量回归的成绩预测方法.首先引入模糊逻辑来计算隶属度矩阵,根据学生的历史成绩进行聚类,随后对每个聚类簇利用支持向量回归理论对成绩轨迹进行拟合建模.此外,结合学生学习行为等相关属性,对最终的预测结果做调整.在多个基准数据集上进行了实验测试,验证了该方法的有效性.Abstract: Existing performance prediction models tend to overuse different types of attributes, leading to either overly complex prediction methods or models that require manual participation. To improve the accuracy and interpretation of student performance prediction, a method based on fuzzy clustering and support vector regression is proposed. Firstly, fuzzy logic is introduced to calculate the membership matrix, and students are clustered according to their past performance. Then, we use Support Vector Regression (SVR) theory to fit and model performance trajectory for each cluster. Lastly, the final prediction results are adjusted in combination with the students' learning behavior and other related attributes. Experimental results on several baseline datasets demonstrate the validity of the proposed approach.
-
Key words:
- fuzzy clustering /
- support vector regression /
- prediction /
- educational data mining
-
算法1 历史成绩模糊聚类 输入:所有样本记录$ {R}= { \{r}_{1}, {r}_{2}, \cdots, {r}_{n}\}$; 初始聚类中心$C=\{c_1, c_2, \cdots, {c}_k \}$; 聚类个数$K$; 模糊指标$m\; (m>1)$; 控制迭代地最小阈值$\varepsilon $ 输出: $K$个聚类簇以及样本模糊隶属度矩阵 1:计算初始模糊隶属度值$u_{ij} =\frac{1}{\sum\limits_{l=1}^K {(\frac{\vert \vert s_i -c_j \vert \vert }{\vert \vert s_i -c_l \vert \vert })^{\frac{2}{m-1}}} }\quad \quad (3)(\mathrm {\vert \vert \ast \vert \vert }$表示欧氏距离) 2: loop 3:计算聚类中心$\mathrm {c}_j =\frac{\sum\limits_{i=1}^n {u_{ij}^m r_i } }{\sum\limits_{i=1}^n {u_{ij}^m } } \quad \quad (4)$ 4:更新模糊隶属度矩阵 5:计算目标函数$J^i=\sum\limits_{i=1}^n {\sum\limits_{j=1}^K {u_{ij}^m \vert \vert r_i -c_j \vert \vert ^2} } \quad \quad (5)$ 6: until $\vert J^i-J^{i\mathrm {-}1}\vert {\kern 1pt} < \; \varepsilon $ 算法2 最终成绩预测 输入:训练集$ {U }= {\{r}_{1}, {r}_{2}, \cdots, {r}_{n}\}$; 测试集${T }={ \{r}_{1}, {r}_{2}, \cdots, {r}_{d}\}$; 超参数$N$、$\alpha $ 输出:最终成绩的预测值 1:根据2.2小节得到的模糊隶属度矩阵, 针对测试样本在每一聚类簇中计算SVR回归的结果, 表示为$\mathrm {te}mp\_score_{i} =\sum\limits_{i=1}^K {u_i \cdot f(x_i)} \quad (7)$, 其中$K$是聚类簇的个数; 2:计算测试样本与每个归属同一个类的训练样本之间的欧氏距离$ {d}_{i} (1\le i\le \; \vert c_t \vert)$, $\mathrm {\vert c}{ }_t\vert $表示该测试样本所属类里实例数; 3:将步骤2中计算的欧氏距离从小到大排序, 选择前$N$个训练样本, 记为$S_ {n} $; 4:对于$S_{n} $中的每一个实例, 使用学生的学习行为等相关属性, 计算其与测试样本的余弦相似度${Sim}_{t} =\frac{\sum\limits_{i=1}^m {A_i^S \times A_i^T } }{\sqrt {\sum\limits_{i=1}^m {(A_i^S)^2} } \times \sqrt {\sum\limits_{i=1}^m {(A_i^T)^2} } }\quad \; (8)$, 其中$A_{i} (1\le i\le m)$表示学生行为属性, $A_{i}^s $表示测试样本属性, $A_{i}^T $表示$V_{n} $中的第$T$个实例. 5:引入超参数$\alpha $来灵活地控制学生行为属性对最终预测成绩的影响, 计算对成绩的修正${b}_{i} =\alpha \cdot \sum\limits_{n=1}^N (Sim_n \cdot ({s}_{n} {-s}_{i}))\quad (9)$, 其中${s}_n $表示第$ {n}$个训练样本最终成绩, ${s}_{i} $表示测试集中第$ {i}$个样本利用SVR和隶属度值计算得到的预测成绩; 6:得到最终成绩的预测值${score}_{i} ={temp}\_score_i +{b}_i \quad (10)$, ${temp}\_{score}_i $表示步骤1中支持向量回归的结果. 表 1 5种方法的均方差对比
Tab. 1 Mean squared error comparison of the five methods studied
数据集 方法 FCSVR MLR BR EN SVR UCI-Math 0.21 0.30 0.29 0.62 0.36 UCI-Portuguese 0.17 0.29 0.30 1.38 0.62 Stu-Common 0.26 0.30 0.30 0.73 0.30 表 2 5种方法的平均绝对值误差对比
Tab. 2 Mean absolute error comparison of the five methods studied
数据集 方法 FCSVR MLR BR EN SVR UCI-Math 0.32 0.31 0.30 0.53 0.37 UCI-Portuguese 0.28 0.31 0.31 0.87 0.44 Stu-Common 0.39 0.44 0.44 0.73 0.43 -
[1] 吕红胤, 连德富, 聂敏, 等.大数据引领教育未来:从成绩预测谈起[J].大数据, 2015, 1(4):118-121. [2] BORKAR S, RAJESWARI K. Attributes selection for predicting students' academic performance using education data mining and artificial neural network[J]. International Journal of Computer Applications, 2014, 86(10):25-29. doi: 10.5120/15022-3310 [3] LAN A S, WATERS A E, STUDER C, et al. Sparse factor analysis for learning and content analytics[J]. Journal of Machine Learning Research, 2013, 15(1):1959-2008. http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1303.5685 [4] 张嘉, 张晖, 赵旭剑, 等.规则半自动学习的概率软逻辑推理模型[J].计算机应用, 2018, 38(11):98-103. http://d.old.wanfangdata.com.cn/Periodical/jsjyy201811017 [5] 薛颖, 沙秀艳.基于改进模糊聚类算法的灰色预测模型[J].统计与决策, 2017(9):29-32. http://d.old.wanfangdata.com.cn/Periodical/tjyjc201709006 [6] 文传军, 詹永照.基于样本模糊隶属度归n化约束的松弛模糊C均值聚类算法[J].科学技术与工程, 2017, 17(36):96-104. doi: 10.3969/j.issn.1671-1815.2017.36.015 [7] 赵琦, 孙泽斌, 冯文全, 等.一种基于支持向量回归的建模方法[J].北京航空航天大学学报, 2017, 43(2):352-359 http://d.old.wanfangdata.com.cn/Periodical/bjhkhtdxxb201702018 [8] 张麒增, 戴翰波.基于数据预处理技术的学生成绩预测模型研究[J].湖北大学学报(自然科学版), 2019, 41(1):106-113. http://d.old.wanfangdata.com.cn/Periodical/hbdxxb201901019 [9] 孙毅, 刘仁云, 王松, 等.基于多元线性回归模型的考试成绩评价与预测[J].吉林大学学报(信息科学版), 2013, 31(4):404-408. doi: 10.3969/j.issn.1671-5896.2013.04.013 [10] 陈岷.因子分析和神经网络相融合的体育成绩预测模型[J].现代电子技术, 2017(5):138-141. http://d.old.wanfangdata.com.cn/Periodical/xddzjs201705033 [11] NÚÑEZ J C, SUÁREZ N, ROSÁRIO P, et al. Relationships between perceived parental involvement in homework, student homework behaviors, and academic achievement:Differences among elementary, junior high, and high school students[J]. Metacognition and Learning, 2015, 10(3):375-406. doi: 10.1007/s11409-015-9135-5 [12] BUNKAR K, SINGH U K, PANDYA B, et al. Data mining: Prediction for performance improvement of graduate students using classification[C]//IEEE 2012 Ninth International Conference on Wireless and Optical Communications Networks (WOCN). New York: IEEE, 2012: 1-5.