Design and implementation of Smart materialization for column-store in CLAIMS
-
摘要: 物化是列存储数据库查询中必不可少的操作,物化策略和物化技术在查询执行过程中起着至关重要的作用.因此设计一种针对列存储数据库的物化策略尤为重要.提前物化生成的元组中存在无关属性;而延迟物化对选择率较高的查询可能无法优化其性能,且某些列会被访问多次.针对以上缺点,本文提出了有别于上述两种策略的策略——Smart物化策略.本文提出了在逻辑查询计划中使用结构——projection,该结构是由用户选取查询所需的属性来生成的,相当于对全表进行物理上的切分;在查询开始时,能减少直接加载到内存的数据量,避免额外的开销.在构建逻辑查询计划过程中,Smart物化策略将projection作为扫描操作标准来对数据进行按列划分,根据一组语句集中对列访问的相关性来对下一次查询所需要的列进行预测,将所需要的列加入到一个最合适的projection中来进行物化.本文通过在分布式内存数据库CLAIMS上使用TPC-H数据集来验证其有效性.
-
关键词:
- projection /
- Smart物化 /
- 数据压缩
Abstract: Materialization is a necessary operation in the process of query execution. Materialization strategy and materialization technology play an important role in the process of query execution. Therefore, it is necessary to design a materialization strategy for column-store database. According to the shortcomings of early materialization and later materialization, we provide a strategy named Smart materialization that are different from the two strategies mentioned above. Here we need to define a concept in the logical query plan-projection, the structure is used to select the desired attributes, the physical table is cut by column, to ensure that the structure at the beginning of the query can reduce the direct load to memory of the amount of data, to avoid additional overhead. In the logical query plan, the projection is divided by columns, and the next required columns are predicted according to the relevance of the query in a set of queries, and the required columns are stabilized in one of the most appropriate projection. We use the data set of TPC-H to verify its validity worked on the disturbed in-memory database-CLAIMS.-
Key words:
- projection /
- Smart materialization /
- data compression
-
算法1 延迟物化 输入: $C$ 为属性集; AGG为聚合操作集; $T$ 为表集; $F$ 为条件集; $S$ 为排序操作集
select $\{C\}$ $\{AGG\}$ from $\{T\}$ where $\{F\}$ $\{S\}$
输出:元组集 $\{R\}$
1 for根据集合 $\{F\}$ 中的条件生成符合条件的位向量
2 end for
3 初始化目标位向量 $W$
4 for each $i$ do
5 $W=W$ $op_{I}$ $w_i$
6 end for
7 初始化目标物化结果空间 $O;/\!/O$ 有 $(N+Q)$ 个存储空间
8 for each $i$ do
9 if ( $i<N$ ) then $O_i$ (集合 $C$ 对应的列) AND $W$
10 else $O_i$ (集合AGG中的列) AND $W$
11 end if
12 end for
13 初始化最终查询结果空间 $Sp;/\!/Sp$ 有 $(N+Q)$ 个存储空间
14 for each $i$ do
15 if ( $i\leqslant N$ ) then $Sp_i=O_i$
16 else $Sp_i=$ {AGG} $(O_i)$
17 end if
18 end for算法2 动态生成projection 输入:最近查询语句集 $Q$ , 表 $R$
输出: projection
1 对于集合 $Q$ 中每一条语句, 执行如下流程
2 for选取 $Q$ 集中的一条 $q$ do
3 for选取与 $R$ 表中 $r_i$ do
4 if $r_i$ 与 $q$ 相关性大then
5 更新 $r_i\leftarrow r_i+w\times (q-r_i)$
6 end if
7 end for
8 end for
9 使用贪心算法, 将语句中最为常用列记录, 以此来生成新的projection
10 生成projection结束表 1 在不同SQL语句中3种物化策略的时延
Tab. 1 Simulation results of parametric estimation
SQL语句 Smart物化产生的时延/s 提前物化的时延/s 延迟物化的时延/s TPC-H 1 2.9 6.3 2.9 TPC-H 3 3.7 15 4.2 select sum(L_REVENUE) from LINEITEM sort by L_EXTENDEDPRICE 0.23 3.2 0.44 select row_id from PART where row_id>200 0 5.8 2.3 -
[1] STONEBRAKER M, ABADI D J, BATKIN A, et al. C-store:A column-oriented DBMS[C]//International Conference on Very Large Data Bases. DBLP, 2005:553-564. [2] COPELAND G P, KHOSHAFIAN S N. A decomposition storage model[C]//ACM SIGMOD International Conference on Management of Data. ACM, 1985:268-279. [3] STONEBRAKER M, ETINTEMEL U. "One Size Fits All":An idea whose time has come and gone[C]//International Conference on Data Engineering. IEEE, 2005:2-11. [4] CORMACK G V. Data compression on a database system[J]. Communications of the ACM, 1985, 28(12):1336-1342. doi: 10.1145/214956.214963 [5] 黄鹏, 李占山, 张永刚, 等.基于列存储数据库的压缩态数据访问算法[J].吉林大学学报(理学版), 2009, 47(5):1013-1019. http://www.cnki.com.cn/Article/CJFDTOTAL-JLDX200905032.htm [6] Google Snappy[EB/OL].[2017-04-01]. https://github.com/google/snappy. [7] WANG L, ZHOU M, ZHANG Z, et al. Elastic pipelining in an in-memory database cluster[C]//ACM SIGMOD. ACM, 2016:1279-1294. [8] SIKKA V, RBER F, GOEL A, et al. SAP HANA:The evolution from a modern main-memory data platform to an enterprise application platform[J]. Proceedings of the Vldb Endowment, 2013, 6(11):1184-1185. doi: 10.14778/2536222 [9] ABADI D, MADDEN S, FERREIRA M. Integrating compression and execution in column-oriented database systems[C]//ACM SIGMOD International Conference on Management of Data. DBLP, 2006:671-682. [10] ABADI D J, MYERS D S, DEWITT D J, et al. Materialization strategies in a column-oriented DBMS[C]//2007 IEEE 23rd International Conference on Data Engineering. IEEE, 2007:466-475. [11] CORNELL D W, YU P S. An effective approach to vertical partitioning for physical design of relational databases[J]. IEEE Transactions on Software Engineering, 1990, 16(2):248-258. doi: 10.1109/32.44388 [12] BRYANT R E, HALLARON D R O'. 深入理解计算机系统[M]. 龚奕利, 雷迎春, 译. 北京: 机械工业出版社, 2011. [13] IDREOS S, KERSTEN M L, MANEGOLD S. Self-organizing tuple reconstruction in column-stores[C]//ACM SIGMOD International Conference on Management of Data. ACM, 2009:297-308. [14] 杨传辉.大规模分布式存储系统[M].北京:机械工业出版社, 2013. [15] BRUNO N, CHAUDHURI S. To tune or not to tune?:A lightweight physical design alerter[C]//International Conference on Very Large Data Bases. VLDB Endowment, 2006:499-510. [16] TPC Benchmark H[EB/OL].[2017-04-01]. http://www.tpc.org/tpch/.