基于知识图谱和LDA模型的社会媒体数据抽取

麻友; 岳昆; 张子辰; 王笑一; 郭建斌

doi:10.3969/j.issn.1000-5641.2018.05.016

基于知识图谱和LDA模型的社会媒体数据抽取

doi: 10.3969/j.issn.1000-5641.2018.05.016

麻友^1,,
岳昆^1, ,,
张子辰¹,
王笑一²,
郭建斌²

1.
云南大学信息学院, 昆明 650500
2.
云南大学民族学与社会学学院, 昆明 650500

基金项目:

国家自然科学基金 61472345

云南大学青年英才培育计划 WX173602

云南大学科研基金 2017YDJQ06

云南大学研究生科研创新基金 Y2000211

详细信息

作者简介:
麻友, 男, 硕士研究生, 研究方向为海量数据处理与知识发现.E-mail:1172880152@qq.com

通讯作者:
岳昆, 男, 教授, 博士生导师, 研究方向为海量数据处理与知识发现.E-mail:kyue@ynu.edu.cn

中图分类号: TP311
计量
- 文章访问数: 150
- HTML全文浏览量: 76
- PDF下载量: 293
- 被引次数: 0
出版历程
- 收稿日期: 2018-07-10
- 刊出日期: 2018-09-25

Extraction of social media data based on the knowledge graph and LDA model

1.
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
2.
School of Ethnology and Sociology, Yunnan University, Kunming 650500, China

摘要

摘要: 社会媒体数据的抽取，是社会舆论集散、新闻信息传播、企业品牌推广、商业营销拓展等研究和应用的基础，准确的抽取结果是数据分析有效性的重要保证.本文针对社会媒体数据的非结构、多主题特征，基于LDA（Latent DirichletAllocation）模型挖掘数据中的隐含主题，利用数据特征词序列和知识图谱描述的实体及实体间的关联关系，实现对特定领域数据的抽取.建立在"今日头条"新闻数据和新浪微博数据之上的实验结果表明，本文提出的方法能有效地实现社会媒体数据的抽取.
- 社会媒体数据 /
- 数据抽取 /
- 隐含狄利克雷分配 /
- 知识图谱
Abstract: Social media data extraction forms the basis of research and applications related to public opinion, news dissemination, corporate brand promotion, commercial marketing development, etc. Accurate extraction results are critical to guarantee the effectiveness of the data analysis. In this paper, we analyze the underlying topics in data based on the LDA (Latent Dirichlet Allocation) model; we further implement data extraction in specific domains by adopting featured word sequences and knowledge graphs that describe entities and relevant relationships. Experimental results using "Headline Today" news and Sina Weibo data show that our proposed method can be used to extract social media data effectively.
- social media /
- data extraction /
- LDA (Latent Dirichlet Allocation) /
- knowledge graph

HTML全文

图 1 LDA图模型

Fig. 1 LDA graphic models

下载: 全尺寸图片幻灯片

图 2 领域KG示例

Fig. 2 Example of field KG

下载: 全尺寸图片幻灯片

图 3 主题分析准确率

Fig. 3 Precision of topic analysis

下载: 全尺寸图片幻灯片

图 4 主题分析召回率

Fig. 4 Recall of topic analysis

下载: 全尺寸图片幻灯片

图 5 主题分析$F$值

Fig. 5 $F$-Measure of topic analysis

下载: 全尺寸图片幻灯片

图 6 数据抽取准确率

Fig. 6 Precision of data extraction

下载: 全尺寸图片幻灯片

图 7 数据抽取召回率

Fig. 7 Recall of data extraction

下载: 全尺寸图片幻灯片

图 8 数据抽取$F$值

Fig. 8 $F$-Measure of data extraction

下载: 全尺寸图片幻灯片

图 9 算法1执行时间

Fig. 9 Execution time of Algorithm 1

下载: 全尺寸图片幻灯片

图 10 执行时间对比

Fig. 10 Comparison of execution time

下载: 全尺寸图片幻灯片

表 1 符号及含义

Tab. 1 Notations

符号	含义
$I_{i}$	第$i$条数据
$z_{k}$	第$k$个主题
$M$	社会媒体数据总条数
$w_{i, j}$	第$i$条数据第$j$个词
$z_{i, j}$	词$w_{i, j}$所属的主题
$\Lambda_{i}$	第$i$条数据的主题向量
$\lambda_{k, i}$	$I_{i}$中词汇属于主题$z_{k}$的概率
$\Delta_{k}$	主题$z_{k}$的高频词向量
$\delta_{t, k}$	主题$z_{k}$总词汇中的词$w_{t}$的概率
$\chi_{i}$	数据$I_{i}$的主题的高频词向量
$d_{i}$	数据$I_{i}$的特征词序列

下载: 导出CSV

表 2 社会媒体数据示例

Tab. 2 Examples of social media data

id	$\mathit{\boldsymbol{T}}_{i}$	$\mathit{\boldsymbol{A}}_{i}$
a001	"再赴西藏之——	{$A_{i, u}$="国家摄影unpcn", $A_{i, p}$="2012-9-5 17:14",
	青藏高原的风光	$A_{i, l}$="", $A_{i, v}$="微博weibo.com", $A_{i, f}$=0, $A_{i, q}$=22,
	http://t.cn/zWgYG2J"	$A_{i, c}$=12, $ A_{i, r=}$"2016-8-24"}

下载: 导出CSV

算法 1 主题高频词与数据特征词的获取

输入: 数据集$I$, 迭代数$N_{iter}$, 主题总数$K$, 参数$\alpha$, $\beta$, $\kappa$
输出: 数据主题向量$\varLambda_i$, 主题高频词向量$\varDelta_k$, 特征词序列$d_i$

1: for $k=1$ to $K$ do
        采样参数$\varphi_k$~Dir($\beta$)
    end for
2: $n_i^{(k)}\leftarrow 0$; $n_k^{(t)}\leftarrow 0$
    for each $I_i$ in $I$ do
        采样参数$\theta_i$~Dir$(\alpha)$
        for each $w_{i, j}$ in $I_i$ do
            采样主题$z_{i, j}$~Mult($\theta _i$); $w_{i, j}$~Mult($\varphi_{z_{i, j}} )$
            $n_i^{(k)}\leftarrow n_i^{(k)}+1$; $n_k^{(t)}\leftarrow n_k^{(t)}+1$
        end for
    end for
3: for $x=1$ to $N_{iter}$ do
        for each $I_i$ in $I$ do
            for each $w_{i, j}$ in $I_i$ do
                if $z_{i, j}=k$ then
                    $n_i^{(k)}\leftarrow n_i^{(k)}+1$
                    $n_k^{(t)}\leftarrow n_k^{(t)}+1$
                end if
            end for
        $\delta_{t, k}\leftarrow$根据公式(2)得
        $\Delta_k \leftarrow$降序排列$\delta_{t, k}$, 得到$z_k$的高频词向量
        $\lambda_{k, i}\leftarrow$根据公式(3)得
        $\Lambda_i\leftarrow$降序排列$\lambda_{k, i}$, 得到$I_i$的主题向量
        end for
    end for
4: for each $I_i$ in $I$ do
        获取$I_i$的高频词向量$\Lambda_i$按$\lambda_{k, i}$降序的top-$\kappa$个主题
        $d_i\leftarrow (T_i$中的词汇)$\cap$(top-$\kappa$个主题的高频词向量$\varDelta_k$的词汇)
      end for
    return ($\varLambda _i, \varDelta _k, d_i$)

下载: 导出CSV

算法 2 特定领域数据抽取

输入: 数据特征词序列$d_{i}$, 领域$G_{k}$, 领域$\neg G_{k}$, 参数$\tau $
输出: 特定领域的数据集$D$

for each $w_{i, j}$ in $ d_{i}$ do
    $m_{i}\leftarrow d_{i}$的长度
    for $v_{x}=v_{0 }$ in $G_{k}$ then
        if $w_{i, j }= v_{x}$ then
            $n\leftarrow n$+1
        else $v_{x}\leftarrow v_{x+1}$, 当边($v_{x}$, $v_{x+1}$, ${label}$)存在于$G_{k}$中
        end if
    end for
    for $v_{x}=v_{0 }$ in $\neg G_{k}$ then
        if $w_{i, j }= v_{x}$ then
            $n' \leftarrow n' $+1
        else $v_{x}\leftarrow v_{x+1}$, 当边($v_{x}$, $v_{x+1}$, ${label}$)存在于$\neg G_{k}$中
        end if
    end for
    $p\leftarrow n$/$m_{i}$
    $p' \leftarrow n' $/$m_{i}$
    if $p >\tau $ and $ p' <\tau $ then
        $D\leftarrow D \cup \{I_{i}\}$
    end if
end for
return $D$

下载: 导出CSV

表 3 数据抽取结果示例

Tab. 3 Example of data extraction results

id	$T_{i}$	$\chi_{i}$	$d_{i}$	$A_{i}$
ua6r54	#旅行攻略#【西藏】假期西藏游成为很多游客的旅游目的地, 去拉萨通常飞贡嘎机场, 离拉萨67 km, 不过推荐坐高铁直达市区酒店.	((酒店, 0.009 1), (城市, 0.007 2), (旅游, 0.005 2), (旅行, 0.004 6), (假期, 0.004 5), (建筑, 0.004 3), (攻略, 0.003 8), (文化, 0.003 7), (公园, 0.003 2), (机场, 0.003 1), (推荐, 0.002 8), (km, 0.002 6), (游客, 0.002 5), (特色, 0.002 4), (高铁, 0.002 4))	< 旅行, 攻略, 假期, 游客, 旅游, 机场, km, 推荐, 高铁, 酒店>	{$A_{i, u}$="穷游网"$A_{i, p}$="2016-7-5 17:44", $A_{i, l}$ ="", $A_{i, v}$="微博", $A_{i, f}$ =9, $A_{i, q}$=7, $A_{i, c}$=20, $ A_{i, r=}$"2016-8-24"}

下载: 导出CSV

参考文献(25)

[1]	OUYANG Y, GUO B, ZHANG J, et al. SentiStory:Multi-grained sentiment analysis and event summarization with crowdsourced social media data[J]. Personal & Ubiquitous Computing, 2017, 21(1):97-111. doi: 10.1007/s00779-016-0977-x
[2]	HE W, WANG F K, AKULA V. Managing extracted knowledge from big social media data for business decision making[J]. Journal of Knowledge Management, 2017, 21(2):275-294. doi: 10.1108/JKM-07-2015-0296
[3]	ZHOU X, GUO L, LIU P, et al. Latent factor SVM for text categorization[C]//IEEE International Conference on Data Mining Workshop. IEEE, 2015: 105-110.
[4]	WAJEED M A, ADILAKSHMI T. Supervised and semi-supervised learning in text classification using enhanced KNN algorithm:A comparative study of supervised and semi-supervised classification in text categorization[J]. International Journal of Intelligent Systems Technologies & Applications, 2012, 11(3/4):179-195.
[5]	RISTIN M, GUILLAUMIN M, GALL J, et al. Incremental learning of random forests for Large-Scale image classification[J]. IEEE Trans Pattern Anal Mach Intell, 2016, 38(3):490-503. doi: 10.1109/TPAMI.2015.2459678
[6]	BLEI D, NG A, JORDAN M. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003(3):993-1022. http://d.old.wanfangdata.com.cn/Periodical/jsjyy201306024
[7]	JARADAT S, DOKOOHAKI N, MATSKIN M. OLLDA: A supervised and dynamic topic mining framework in twitter[C]//2015 IEEE 15th International Conference on Data Mining Workshop. IEEE, 2016: 1354-1359.
[8]	刘少鹏, 印鉴, 欧阳佳, 等.基于MB-HDP模型的微博主题挖掘[J].计算机学报, 2015, 38(7):1408-1419. http://d.old.wanfangdata.com.cn/Periodical/jsjxb201507008
[9]	DUPUY C, BACH F, DIOT C. Qualitative and descriptive topic extraction from movie reviews using LDA[C]//Machine Learning and Data Mining in Pattern Recognition. Springer, 2017: 91-106.
[10]	MA J, YAO Z, SUN M. WSO-LDA: An online "Sentiment+Topic" weibo topic mining algorithm[C/OL]//Pacific Asia Conference on Information Systems.[2018-07-01].http://aisel.aisnet.org/pacis2017/223.
[11]	刘冰玉, 王翠荣, 王聪, 等.基于动态主题模型融合多维数据的微博社区发现算法[J].软件学报, 2017, 28(2):246-261. http://d.old.wanfangdata.com.cn/Periodical/rjxb201702005
[12]	KHOLGHI M, SITBON L, ZUCCON G, et al. External knowledge and query strategies in active learning: A study in clinical information extraction[C]//24th ACM International on Conference on Information and Knowledge Management. ACM, 2015: 143-152.
[13]	陈德华, 殷苏娜, 乐嘉锦, 等.一种面向临床领域时序知识图谱的链接预测模型[J].计算机研究与发展, 2017, 54(12):2687-2697. doi: 10.7544/issn1000-1239.2017.20170640
[14]	ORAMAS S, ESPINOSA-ANKE L, SORDO M, et al. Information extraction for knowledge base construction in the music domain[J]. Data & Knowledge Engineering, 2016, 106:70-83. http://www.sciencedirect.com/science/article/pii/S0169023X16300416
[15]	VELASCO-ELÍZOÑDO P, MARIN-PINA R, VAZQUEZ-REYES S, et al. Knowledge representation and information extraction for analysing architectural patterns[J]. Science of Computer Programming, 2016, 121:176-189. doi: 10.1016/j.scico.2015.12.007
[16]	DIETZ L, KOTOV A, MEIJ E. Utilizing knowledge graphs in text-centric information retrieval[C]//Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017: 815-816.
[17]	高俊平, 张晖, 赵旭剑, 等.面向维基百科的领域知识演化关系抽取[J].计算机学报, 2016, 39(10):2088-2101. doi: 10.11897/SP.J.1016.2016.02088
[18]	MARIN A, HOLENSTEIN R, SARIKAYA R, et al. Learning phrase patterns for text classification using a knowledge graph and unlabeled data[J]. ISCA-International Speech Communication Association, 2014(15):253-257. http://research.microsoft.com/apps/pubs/default.aspx?id=226224
[19]	KLIEGR T, ZAMAZAL O. LHD 2.0:A text mining approach to typing entities in knowledge graphs[J]. Web Semantics Science Services & Agents on the World Wide Web, 2016, 39:47-61. http://www.sciencedirect.com/science/article/pii/S1570826816300166
[20]	SHI W, ZHENG W, YU J X, et al. Keyphrase extraction using knowledge graphs[J]. Data Science & Engineering, 2017, 2(4):275-288. doi: 10.1007/s41019-017-0055-z
[21]	CHEN Z, LIU B. Mining topics in documents: Standing on the shoulders of big data[C]//20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014: 1116-1125.
[22]	BLEI D. Probabilistic topic models[J]. Communications of the ACM, 2012, 55(4):77-84. doi: 10.1145/2133806
[23]	LU Y, MEI Q, ZHAI C. Investigating task performance of probabilistic topic models:An empirical study of PLSA and LDA[J]. Information Retrieval, 2011, 14(2):178-203. doi: 10.1007/s10791-010-9141-9
[24]	北京字节跳动科技有限公司.今日头条媒体平台[EB/OL].[2017-12-31]. https://www.toutiao.com/.
[25]	KNUTH D E, MORRIS J H, PRATT V R, et al. Fast pattern matching in strings[J]. SIAM Journal on Computing, 1977, 6(2):323-350. doi: 10.1137/0206024