Two new Top-k queries in uncertain database
-
摘要: 由于当前已有的在非确定数据库上的Top-k查询普遍基于元组层面,使得应用受限, 为了让查询结果更符合直觉,提出了两种新的非确定数据库上的基于x-元组层面的Top-k查询及其执行算法.这两种新的查询综合x元组中各元组的评分和置信度,获得在返回结果中最具实际意义的位置. 查询的执行算法经过优化,执行效率明显改善.Abstract: Since currently, the pre-existing Top-k queries in uncertain databases almost bases on tuple level rather than xtuple level, which restricts its application, for intuitive query result, the paper proposes two new local-instance level top-k queries and executive algorithm in uncertain databases. These two new queries take both rank and confidence of each x-tuple's tuple, figuring out the most meaningful position in the returned results. After the optimization of the executive algorithm, its executive efficiency has been improved manifestly.
-
Key words:
- uncertain database /
- Top-k query
-
表 1 数据库udb1
Tab. 1 The database udb1
x-tuple tuple score confidence T1 t1 58 0.4 t2 38 0.1 t3 22 0.5 T2 t1 47 0.4 t2 42 0.2 t3 33 0.4 T3 t1 53 0.7 t2 26 0.1 t3 15 0.2 表 2 Global Top-k和U-k Ranks的计算示例(概率)
Tab. 2 The example of computation of Global Top-k and U-kRanks (probability)
T1.t1 T2.t1 T3.t1 T2.t2 T1.t2 T2.t3 … Rank-1 0.4 0.072 0.42 0.036 0.012 0.06 … Rank-2 0 0.216 0.28 0.108 0.046 0.2 … Top-2 0.4 0.288 0.7 0.144 0.058 0.26 … 表 3 温度传感器非确定性数据库示例
Tab. 3 The example of uncertain database under temperature sensor
传感器 时间 地点 温度/℃ 置信度 C1 10: 00 L1001 58 0.4 C2 10: 00 L1001 38 0.1 C3 10: 00 L1001 22 0.5 C4 10: 00 L1002 47 0.4 C5 10: 00 L1002 42 0.2 C6 10: 00 L1002 33 0.4 C7 10: 00 L1003 53 0.7 C8 10: 00 L1003 26 0.1 C9 10: 00 L1003 15 0.2 表 4 预处理后的udb1
Tab. 4 The udb1 after preprocessing
x-tuple tuple score confidence T1 t1 95 0.4 T3 t1 92 0.7 T2 t1 89 0.4 T2 t2 85 0.2 T1 t2 81 0.1 T2 t3 80 0.4 T3 t2 75 0.1 T1 t3 73 0.5 T3 t3 70 0.2 表 5 OS-State [T3,T1]的分组在查询执行程序 访问到T1.t2时的情况
Tab. 5 The situation of the groups of OS-State [T3,T1] when query processing procedure accessing T1.t2
组 计算表达式 值 Sj[0] T3.t1.prob×(1−P(T2.xid))×T1.t2.prob 0.7×(1−0.4−0.2)×0.1 Sj[1] T3.t1.prob×(1−P(T1.xid))×(1−P(T2.xid)) 0.7×(1−0.4−0.1)×(1−0.4−0.2)0.7×(1−0.4−0.1)×(1−0.4−0.2) Sj[2] (1−P(T3.xid))×(1−P(T1.xid))×(1−P(T2.xid)) (1−0.7)×(1−0.4−0.1)×(1−0.4−0.2) Sj.ub ∑ 0.228 表 6 OS-State 置信度上界、下界维护方法
Tab. 6 The method of maintaining OS-State's upper bound and lower bound of confidence
匹配Sj[i](1≤i<k) 匹配Sj[k]Sj[k] 不匹配 Sj[0] 保持 DFS 保持 Sj[1]…Sj[k−i−1] 保持 ×fu(t.xid) ×fu(t.xid) Sj[k−i]Sj[k−i] DFS ×fu(t.xid) ×fu(t.xid) Sj[k−i]…Sj[k] ×fu(t.xid) ×fu(t.xid) ×fu(t.xid) 表 7 OI-State置信度上界、下界维护方法(t表示最新访问的元组)
Tab. 7 The method of maintaining OI-State's upper bound and lower bound of confidence (t means the accessing tuple)
匹配$S_j[i](1\leqslant i\leqslant k)$ 不匹配 lb DFS 保持 ub 保持 (ub--lb)$\times f_u(t.$xid)+lb Procedure 1 OS-UTop-k!OI-UTop-k Require: Pre-sorted Uncertain Database udb, k 1: N←0, P[·]0.0 2: SL←empty state list 3: while source is not exhausted do 4: t←the next tuple of udb 5: P[t.xid]←P[t.xid] + t.prob 6: if P[t.xid] = 1.0 then 7: N←N + 1 8: end if 9: if N = k then 10: quit the while loop 11: end if 12: Update states in SL by guidance as TABLE 6 or TABLE 7 13: Prune the states with upper bound less than lower bound of another 14: Generate States by t such as Procedure 2 and push into SL 15: end while 16: get the state with highest confidence as result from the SL Procedure 2 generate OS-States by tuple t Require: p is the position of t in Pre-sorted Uncertain Database udb 1: used←boolean array of false 2: tmp←array storing xid 3: used[t.xid]←TRUE 4: DFS(0,0,p) procedure DFS(i, j, p) 1: if j>p then 2: return 3: end if 4: if i = k − 1 then 5: generate OS-States Sv by xids in tmp and push into SL 6: calculate the (k + 1) groups of Sv as TABLE 5 7: end if 8: if used [udb[j].xid]=TRUE then 9: tmp[i]←udb[j].xid 10: used udb[j].xid←TRUE 11: DFS(i + k, j + 1) 12: used udb[j].xid −FALSE 13: end if 14: DFS(i, j + 1) Procedure 3 generate OI-States by tuple t Require: p is the position of t in Pre-sorted Uncertain Database D 1: if P[t.xid]>0.0 then 2: quit the procedure 3: end if 4: tmp←array storing xid 5: S←the list of seen objects' xid except t.xid 6: DFS(0,0, p) procedure DFS(i, j, p) 1: if i >p then 2: return 3: end if 4: if j = k − 1 then 5: generate OI-States Sv by xids in tmp and push into SL 6: return 7: end if 8: tmp[j]←S[i] 9: DFS(i + 1, j + 1, p) 10:DFS(i + 1, j, p) -
[1] Data clustering: algorithms and applications[M]. CRC Press, 2013. [2] ZÖFLE A, EMRICH T, SCHMID K A, et al. Representative clustering of uncertain data [C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014: 243-252. [3] LI L, ZHANG X, YU Z, et al. USOM: Mining and visualizing uncertain data based on self-organizing maps [C]//Machine Learning and Cybernetics (ICMLC), 2011 International Conference on. IEEE, 2011, 2: 804-809. [4] CONINX A, BONNEAU G P, DROULEZ J, et al. Visualization of uncertain scalar data fields using color scales and perceptually adapted noise [C]//Proceedings of the ACM SIGGRAPH Symposium on Applied Perception in Graphics and Visualization. ACM, 2011: 59-66. [5] AGGARWAL C C. Trio A System for Data Uncertainty and Lineage [M]//Managing and Mining Uncertain Data. Springer US, 2009: 1-35. [6] FUXMAN, A, FAZLI, E, AND MILLER, R. J. 2005. ConQuer: Efficient management of inconsistentdatabases. In Proceedings of the ACM SIGMOD International Conference on Management ofData (SIGMOD'05). ACM, New York, 155--166. [7] SINGH S, MAYFIELD C, MITTAL S, et al. Orion 2.0: native support for uncertain data [C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008: 1239-1242. [8] BENJELLOUN O, SARMA A D, HAYWORTH C, et al. An introduction to ULDBs and the Trio system [J]. IEEE Data Engineering Bulletin, March 2006, 2006. [9] ABITEBOUL S, KANELLAKIS P, GRAHNE G. On the representation and querying of sets of possible worlds [M]. ACM, 1987. [10] BENJELLOUN O, SARMA A D, HALEVY A, et al. Databases with uncertainty and lineage [J]. The VLDB Journal, 2008, 17(2): 243-264. [11] SISTLA A P, WOLFSON O, CHAMBERLAIN S, et al. Querying the uncertain position of moving objects [M]//Temporal databases: research and practice. Springer Berlin Heidelberg, 1998: 310-337. [12] CHENG R, KALASHNIKOV D V, PRABHAKAR S. Querying imprecise data in moving object environments [J]. Knowledge and Data Engineering, IEEE Transactions on, 2004, 16(9): 1112-1127. [13] DE ALMEIDA V T, G\"{U]TING R H. Supporting uncertainty in moving objects in network databases [C]//Proceedings of the 13th annual ACM international workshop on Geographic information systems. ACM, 2005: 31-40. [14] SILBERSTEIN A S, BRAYNARD R, ELLIS C, et al. A sampling-based approach to optimizing top-k queries in sensor networks [C]//Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on. IEEE, 2006: 68-68. [15] CONSIDINE J, LI F, KOLLIOS G, et al. Approximate aggregation techniques for sensor databases [C]//Data Engineering, 2004. Proceedings. 20th International Conference on. IEEE, 2004: 449-460. [16] JAYRAM T S, KRISHNAMURTHY R, RAGHAVAN S, et al. Avatar Information Extraction System [J]. IEEE Data Eng. Bull., 2006, 29(1): 40-48. [17] GUPTA R, SARAWAGI S. Creating probabilistic databases from information extraction models [C]//Proceedings of the international conference on Very large data bases. 2006, 32(2): 965. [18] ILYAS I F, BESKALES G, SOLIMAN M A. A survey of Top-$k$ query processing techniques in relational database systems [J]. ACM Computing Surveys (CSUR), 2008, 40(4): 11-11. [19] GETOOR L, DIEHL C P. Link mining: a survey [J]. ACM SIGKDD Explorations Newsletter, 2005, 7(2): 3-12. [20] HRISTIDIS V, KOUDAS N, PAPAKONSTANTINOU Y. PREFER: A system for the efficient execution of multi-parametric ranked queries [C]//ACM SIGMOD Record. ACM, 2001, 30(2): 259-270. [21] NATSEV A, CHANG Y C, SMITH J R, et al. Supporting incremental join queries on ranked inputs [C]//VLDB. 2001, 1: 281-290. [22] ILYAS I F, SHAH R, AREF W G, et al. Rank-aware query optimization [C]//Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 2004: 203-214. [23] LI C, CHANG K C C, ILYAS I F, et al. RankSQL: query algebra and optimization for relational top-k queries [C]//Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 131-142. [24] HUA M, PEI J, ZHANG W, et al. Ranking queries on uncertain data: a probabilistic threshold approach[C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008: 673-686. [25] ZHANG X, CHOMICKI J. Semantics and evaluation of Top-$k $ queries in probabilistic databases[J]. Distributed and parallel databases, 2009, 26(1): 67-126.