Data cleaning on probabilistic RDF database
-
摘要: 由于在获取、解析数据的过程中存在误差、干扰等因素,很多领域的数据中存在着不确定性,这已成为影响数据性能的重要因素.概率数据库可以存储不确定数据并且返回带有置信度的查询结果.然而,不确定性的累积和传播会降低查询结果的可用性.因此,有必要降低概率数据库中数据的不确定性.致力于解决在概率RDF(Resource Description Framework)数据库图查询中如何由众包来提升查询结果的确定性,基本思想是让众包工作者决定由边表示的关系是否正确,以降低整个查询的不确定性.提出了3种不同的算法来选择使查询结果不确定性下降最大的边.最后,通过实验验证了提出的算法,表明不稳定剪枝算法和稳定剪枝算法具有更好的效果.Abstract: Due to the factors such as errors and noises in the process of obtaining and analyzing data, uncertain data arises in many domains, which has emerged as an important issue affecting the performance of data. Uncertain data can be stored in probabilistic databases and query facilities always yield answers with confidence. However, the accumulation and propagation of uncertainty may reduce the usability of the query results. As such, it is desirable to reduce the uncertainty of uncertain data. This paper aims at solving the problem how to promote the answers' certainty in RDF(resource description framework) graph query via crowdsourcing. The basic idea is to ask the crowd to decide whether the relationships represented by some edges are correct. In this paper, we introduce three different algorithms to select the edge which maximizes the uncertainty reduction. Finally, we verify these algorithms by experiments and show that unstable pruning algorithm and stable pruning algorithm perform better in term of efficiency.
-
Key words:
- probabilistic RDF graph /
- crowdsourcing /
- data cleaning
-
图 2 概率RDF数据的图表示
Fig. 2 Graph representation of probabilistic RDF data in Fig. 1
图 8 关于图 3中边$AaC$的ECQ
Fig. 8 An ECQ about the edge $AaC$
同构图 概率 (1) $A\longrightarrow\!\!\!\!\!\!\!\! ^a C\longrightarrow\!\!\!\!\!\!\!\!^b E$ 0.56 (2) $A\longrightarrow\!\!\!\!\!\!\!\!^a D\longrightarrow\! \!\!\!\!\!\!\! ^b E$ 0.2 (3) PHI 0.24 表 2 新的查询结果
Tab. 2 The new results set
同构图 概率 (1) $A\longrightarrow\!\!\!\!\!\!\!\!^a C\longrightarrow\!\!\!\!\!\!\!\!^b E$ 0.7 (2) PHI 0.3 算法1 朴素算法 输入: $RS$结果集 输入: $G$概率RDF数据图 输出: $e_{\max}$使期望质量提升最大化的边 1.访问$RS$获得有效属性集$EA$ 2.访问$G$获得有效边集$EE$ 3. while ($e_{\rm cur}=EE.{\rm getNextEdge}())\neq {\rm null do}$ 4. while (${\rm iso}Q_{\rm cur}=RS.{\rm getNextIso}Q()) \neq {\rm null}$ do 5. if ${\rm iso}Q_{\rm cur}$包含$e_{\rm cur}$ then 6. $e_{\rm cur}\cdot p1\leftarrow e_{\rm cur}\cdot p1+{\rm iso}Q_{\rm cur}\cdot {\rm getProb}()$ 7. else if ${\rm iso}Q_{\rm cur}$包含$e'_{\rm cur}$ then 8. $e_{\rm cur}\cdot p2\leftarrow e_{\rm cur}\cdot p2+{\rm iso}Q_{\rm cur}\cdot {\rm getProb}()$ 9. else 10. $e_{\rm cur}\cdot p3\leftarrow e_{\rm cur}\cdot p3+{\rm iso}Q_{\rm cur}\cdot {\rm getProb}()$ 11. end if 12. end while 13. $e_{\rm cur}\cdot {\rm delta}\leftarrow E\Delta H'$ 14. end while 15. $e_{\max}\leftarrow {\rm null}$ 16. $\max{\rm Delta}H\leftarrow0$ 17. while ($e_{\rm cur}=EE.{\rm getNextEdge}())\neq $ do 18. if $e_{\rm cur}.{\rm getDeltaH}()>\max{\rm Delta}$ then 19: $e_{\max}\leftarrow e_{cur}$ 20: $\max{\rm Delta}H\leftarrow e_{\rm cur} \cdot {\rm getDelta}H()$ 21. end if 22. end while 23. return $e_{\rm max}$ 算法2 不稳定剪枝算法 输入: $RS$结果集 输入: $G$概率RDF数据图 输出: $e_{\max}$使期望质量提升最大化的边 1.访问$RS$获得有效属性集$EA$ 2.访问$G$获得有效边集$EE$ 3.将$EE$按照$H(e)$的大小降序排序 4. $\max{\rm Delta}H\leftarrow 0$ 5. $e_{\max}\leftarrow {\rm null}$ 6. while $e_{\rm cur}=EE.{\rm getNextEdge}())\neq {\rm null}$ do 7. while (${\rm iso}Q_{\rm cur}=RS.{\rm getNextIso}Q()) \neq {\rm null}$ do 8. if ${\rm iso}Q_{\rm cur}$包含$e_{\rm cur}$ then 9. $e_{\rm cur}\cdot p1\leftarrow e_{\rm cur}\cdot p1+{\rm iso}Q_{\rm cur}\cdot {\rm getProb}()$ 10. else if ${\rm iso}Q_{\rm cur}$包含$e'_{\rm cur}$ then 11: $e_{\rm cur}\cdot p2\leftarrow e_{\rm cur}\cdot p2+{\rm iso}Q_{\rm cur}\cdot {\rm getProb}()$ 12. else 13. $e_{\rm cur}\cdot p3\leftarrow e_{\rm cur}\cdot p3+{\rm iso}Q_{\rm cur}\cdot {\rm getProb}()$ 14. end if 15. end while 16. if $H(e_{\rm cur})-EH(RS|E_{ijk})'>{\rm maxDelta}H$ then 17. ${\rm maxDelta}H\leftarrow H(e_{\rm cur}) -EH(RS|E_{ijk})'$ 18. $e_{\rm max}\leftarrow e_{\rm cur}$ 19. end if 20. if $H(e_{\rm cur}) < \max{\rm Delta}H$ then 21. return $e_{\max}$ 22. end if 23. end while 算法3 稳定的剪枝算法 输入: $RS$结果集 输入: $G$概率RDF数据图 输出: $e_{\max}$使期望质量提升最大化的边 1.初始化二维表$EET$为空 2. $Pr({\rm PHI})\leftarrow1$ 3. while (${\rm iso}Q_{\rm cur}=RS.{\rm getNestIso}Q()) \neq {\rm null}$ do 4. $Pr({\rm PHI})\leftarrow -Pr({\rm iso}Q_{\rm cur})$ 5. while (${\rm edg}e_{\rm cur}\in {\rm iso}Q_{\rm cur}) \neq {\rm null}$ do 6. if ${\rm edg}e_{\rm cur}\notin EE $ then 7.在$EET$中寻找其所属的行, 若有则新增${\rm edg}e_{\rm cur}$, 其$Pr(r_1)$初始化为$Pr({\rm iso}Q_{\rm cur});$ 若无则新增属性和${\rm edg}e_{\rm cur}$, 它们的$Pr(r_1)$都初始化为$Pr({\rm iso}Q_{\rm cur})$ 8: $EE$中新增${\rm edg}e_{\rm cur}$并令其指向表中所属位置 9. else 10.其对应边和属性的$Pr(r_1)$加上$Pr({\rm iso}Q_{\rm cur})$ 11. end if 12. end while 13. end while 14.访问$G$将没有出现在同构图中的有效边添加到$EE$中, 并令其指向到$EET$中相应的属性上 15.对$EE$按照$H(e)$的大小降序排序 16. while ($e_{\rm cur}=EE.{\rm getNextEdge}())\neq {\rm null}$ do 17. if $H(e_{\rm cur})-EH(RS|E_{ijk})'>{\rm maxDelta}H$ then 18. $\max{\rm Delta}H\leftarrow H(e_{\rm cur}) -EH(RS|E_{ijk})'$ 19. $e_{\max}\leftarrow e_{\rm cur}$ 20. end if 21. if $H(e_{\rm cur}) < \max{\rm Delta}H$ then 22. return $e_{\max}$ 23. end if 24. end while -
[1] DALVI N, SUCIU D. Efficient query evaluation on probabilistic databases[J]. The VLDB Journal-The International Journal on Very Large Data Bases, 2007, 16(4):523-544. doi: 10.1007/s00778-006-0004-3 [2] ABITEBOUL S, SENELLART P. Querying and updating probabilistic information in XML[C]//Advances in Database Technology-EDBT 2006, International Conference on Extending Database Technology. DBLP, 2006:1059-1068. [3] DONG X, GABRILOVICH E, HEITZ G, et al. Knowledge vault:A web-scale approach to probabilistic knowledge fusion[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2014:601-610. [4] ANGLES R, GUTIERREZ C. Querying RDF data from a graph database perspective[C]//European Semantic Web Conference. Berlin:Springer, 2005:346-360. [5] FANG H, ZHANG X W. pSPARQL:A querying language for probabilistic RDF[C/OL]//Proceedings of the ISWC 2016 Posters & Demonstrations Track Co-Located with 15th International Semantic Web Conference, ISWC 2016.[2016-08-01]. http://ceur-ws.org/Vol-1690/paper18.pdf. [6] FUKUSHIGE Y. Representing probabilistic relations in RDF[C/OL]//Proceedings of the International Semantic Web Conference, ISWC 2005.[2016-08-01]. http://ceur-ws.org/Vol-173/pospaper5.pdf. [7] LIAN X, CHEN L. Efficient query answering in probabilistic rdf graphs[C]//Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011:157-168. [8] HUANG H, LIU C F. Query evaluation on probabilistic RDF databases[C]//International Conference on Web Information Systems Engineering. Berlin:Springer, 2009:307-320. [9] UDREA O, SUBRAHMANIAN V S, MAJKIC Z. Probabilistic RDF[C]//Information Reuse and Integration, 2006 IEEE International Conference on. IEEE, 2006:172-177. [10] ZHANG C J, CHEN L, JAGADISH H V, et al. Reducing uncertainty of schema matching via crowdsourcing[J]. Proceedings of the VLDB Endowment, 2013, 6(9):757-768. doi: 10.14778/2536360 [11] CHENG R, CHEN J C, XIE X K. Cleaning uncertain data with quality guarantees[J]. Proceedings of the VLDB Endowment, 2008(1):722-735. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.6845 [12] LEE J, LEE D, HWANG S. CrowdK:Answering top-k queries with crowdsourcing[J]. Information Sciences, 2017, 399:98-120. doi: 10.1016/j.ins.2017.03.010 [13] CICERI E, FRATERNALI P, MARTINENGHI D, et al. Crowdsourcing for top-k query processing over uncertain data[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1):41-53. doi: 10.1109/TKDE.2015.2462357 [14] VERROIOS V, GARCIA-MOLINA H. Entity resolution with crowd errors[C]//2015 IEEE 31st International Conference on Data Engineering. IEEE, 2015:219-230. [15] ZHANG C J, CHEN L, TONG Y, et al. Cleaning uncertain data with a noisy crowd[C]//2015 IEEE 31st International Conference on Data Engineering. IEEE, 2015:6-17. [16] MARCUS A, WU E, KARGER D, et al. Human-powered sorts and joins[J]. Proceedings of the VLDB Endowment, 2011, 5(1):13-24. doi: 10.14778/2047485 [17] MARCUS A, WU E, KARGER D R, et al. Crowdsourced databases:Query processing with people[C/OL]//Proceedings of the 5th Biennial Conference on Innovative Data Systems Research, CIDR 2011.[1016-08-01]. http://www-db.cs.wisc.edu/cidr/cidr2011/Papers/CIDR11Paper29.pdf. [18] SUCHANEK F M, KASNECI G, WEIKUM G. Yago:a core of semantic knowledge[C]//Proceedings of the 16th International Conference on World Wide Web. ACM, 2007:697-706.