Levenshtein算法优化及在题库判重中的应用

摘要: 为了解决Levenshtein距离算法在长文本和大规模匹配效率的不足，本文针对Levenshtein距离算法提出一种提前终止的优化策略.首先根据Levenshtein距离矩阵中元素内在的联系，归纳总结出一个递推关系式.再依据此递推关系式，提出一种提前终止策略，可提前判断两个文本是否满足预先设定的相似度阈值.经过多个学科题库判重实验的佐证，本文的提前终止策略能显著减少计算时间.
- 题库匹配 /
- 文本相似度 /
- Levenshtein编辑距离
Abstract: In order to overcome the disadvantages of the Levenshtein distance algorithm for long text and large-scale matching, we propose an early termination strategy for the Levenshtein distance algorithm. Firstly, according to the intrinsic relationship between elements in the Levenshtein distance matrix, we sum up a recurrence relation. Based on this relation, an early termination strategy is proposed to determine early-on whether two texts satisfy the predefined similarity threshold. Through several tests on different subjects, it is demonstrated that the early termination strategy can significantly reduce calculation time.
- bank match /
- text similarity /
- Levenshtein edit distance

HTML全文

图 1 改进编辑距离的重复度计算过程

Fig. 1 The flow of repeatability calculation for improved editing distance

下载: 全尺寸图片幻灯片

图 2 10万生物试题计算量和计算时间节省对比图(80%重复度阈值)

Fig. 2 Performance comparison of repeatability caculation saved for 100 000 biology questions (repeatability: 80%)

下载: 全尺寸图片幻灯片

图 3 不同生物试题数量计算时间节省对比图(80%重复度阈值)

Fig. 3 Comparison of calculation time saved for biology questions in different sizes (repeatability: 80%)

下载: 全尺寸图片幻灯片

图 4 不同学科计算时间节省对比图

Fig. 4 Comparison of calculation time saved for different disciplines

下载: 全尺寸图片幻灯片

表 1 字符串$S$与$T$的距离矩阵表

Tab. 1 Distance matrix of strings $S$ and $T$

	Null	$A$	$B$	$C$	$D$
Null	0	1	2	3	4
$E$	1	1	2	3	4
$A$	2	1	2	3	4
$B$	3	2	1	2	3
$C$	4	3	2	1	2

下载: 导出CSV

表 2 $d_{i, j}$的$LD$距离矩阵元素含义

Tab. 2 Meaning of $d_{i, j}$ in $LD$ distance matrix

	Null	$t_1$	$\cdots$	$t_j$	$\cdots$	$t_n$
Null	0	1	$\cdots$	$j$	$\cdots$	$n$
$s_1$	1
$\cdots$	$ \cdots$
$s_i$	$i$			$d_{i, j}$
$\cdots$	$\cdots$
$s_m$	$m$					$d_{m, n}$

下载: 导出CSV

表 3 递推关系式对应$LD$矩阵表

Tab. 3 Recursive relation in $LD$ distance matrix

	Null	$t_1$	$\cdots$	$t_{n-m-1}$	$\cdots$	$t_{n-1}$	$t_n$
Null	0	1	$\cdots$	$n-m-1$	$\cdots$	$n-1$	$n$
$s_1$	1			$d_{1, n-m-1}$
$\cdots$	$\cdots$				$\cdots$
$s_{m-1}$	$m-1$					$d_{m-1, n-1}$
$s_m$	$ m$						$d_{m, n}$

下载: 导出CSV

算法 1 改进Levenshtein算法

输入：字符串$S$与字符串$T$, 相似度阈值$Sim$
输出：字符串$S$与字符串$T$是否相似

1: 比较两个字符串的长度, 将较长字符串的作为行, 较短的作为列, 保证$n\geq m$
2: 初始化$LD$矩阵$LD[m+1, n+1]$, 按照式(1)中$i=0$与$j=0$初始化第一行和第一列, 并初始化变量$line=1$
3: 使用Levenshtein算法计算第$line$行数据
4: 若$LD[line, n-m-line]$大于(1-$Sim)n$, 转步骤9, 否则转步骤5
5: 若$LD[line, n-m-line]+m-line$小于等于(1-$Sim)n$, 转步骤8, 否则转步骤6
6: 若$line$小于m, 转步骤7，否则转步骤8
7: $line++$, 转步骤3
8: 返回两字符串相似
9: 返回两字符串不相似

下载: 导出CSV

表 4 字符串$S$和$T$的原始$LD$矩阵

Tab. 4 $LD$ distance matrix for $S$ and $T$

	Null	$A$	$A$	$A$	$B$
Null	0	1	2	3	4
$A$	1	0	1	2	3
$A$	2	1	0	1	2
$A$	3	2	1	0	1
$ C$	4	3	2	1	1

下载: 导出CSV

表 5 字符串$S'$和$T'$的$LD$矩阵

Tab. 5 $LD$ distance matrix for $S'$ and $T'$

	Null	$B$	$A$	$A$	$A$
Null	0	1	2	3	4
$C$	1	1	2	3	4
$A$	2	2	1	2	3
$A$	3	3	2	1	2
$A$	4	4	3	2	1

下载: 导出CSV

表 6 10万生物试题的各项数据指标表(80%重复度阈值)

Tab. 6 Calculation statistics for 100 000 biology questions (repeatablity: 80%)

长度	个数	计算量	改进后	计算量节省	改进前	改进后	计算时间节省
长度	个数	改进前	计算量	百分比/%	计算时间/s	计算时间/s	百分比/%
0—14	11 616	1.63$\times $10$^{8}$	2.89$\times $10$^{7}$	82.2	14	4	71.4
10—18	16 044	2.17$\times $10$^{8}$	3.98$\times $10$^{7}$	81.7	18	5	72.2
14—23	18 194	4.20$\times $10$^{8}$	6.84$\times $10$^{7}$	83.7	36	10	72.2
18—29	19 651	6.88$\times $10$^{8}$	1.01$\times $10$^{8}$	85.3	50	13	74.0
23—37	20 775	1.03$\times $10$^{9}$	1.52$\times $10$^{8}$	85.3	91	23	74.7
29—47	22 264	1.28$\times $10$^{9}$	1.76$\times $10$^{8}$	86.2	141	32	77.3
37—59	22 345	1.51$\times $10$^{9}$	2.06$\times $10$^{8}$	86.3	206	48	76.6
47—74	20 315	1.44$\times $10$^{9}$	1.82$\times $10$^{8}$	87.3	224	48	78.6
59—93	15 966	7.83$\times $10$^{8}$	1.04$\times $10$^{8}$	86.7	179	35	80.4
74—117	11 176	5.34$\times $10$^{8}$	7.25$\times $10$^{7}$	86.4	185	33	82.2
93—147	7 720	5.50$\times $10$^{8}$	6.99$\times $10$^{7}$	87.3	246	36	85.4
117—$\infty$	7 474	9.93$\times $10$^{8}$	1.56$\times $10$^{8}$	88.7	511	78	84.7

下载: 导出CSV

表 7 不同生物试题数量实验对比表(80%重复度阈值)

Tab. 7 Performance comparison of repeatability calculation for biology questions in different sizes (repeatability: 80%)

试题数量/万	改进前的计算时间/s	改进后的计算时间/s	计算时间节省百分比/s
10	1 913	376	80.3
20	4 816	1 033	78.7
30	7 178	1 393	80.6
40	11 099	2 051	81.5
50	19 998	3 743	81.3

下载: 导出CSV

表 8 不同学科实验对比表(80%重复度阈值)

Tab. 8 Performance comparison of repeatability calculation for multiple disciplines

学科	改进前的	改进后的	计算时间
学科	计算时间/s	计算时间/s	节约百分比/%
语文	223 942	43 626	80.5
数学	131 779	25 838	80.4
物理	50 631	10 355	79.5
化学	35 975	6 372	82.3
生物	19 998	3 743	81.3
地理	25 709	4 944	80.8
政治	44 603	9 462	78.8
历史	27 729	6 075	78.1

下载: 导出CSV

参考文献(13)

[1]	HALL P, DOWLING G. Approximate string matching[J]. ACM Computing Survey, 1980, 12(4):381-402. doi: 10.1145/356827.356830
[2]	WAGNER R, FISCHER M. The string-to-string correction problem[J]. Journal of the ACM, 1974, 21(1):168-173. doi: 10.1145/321796.321811
[3]	章成志.基于多层特征的字符串相似度计算模型[J].情报学报, 2005, 24(6):696-701. doi: 10.3969/j.issn.1000-0135.2005.06.010
[4]	MONGE A, ELKAN C. The field matching problem: Algorithms and applications[C]//Proc of ACM International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1996: 267-270.
[5]	NIRENBURG S, DOMASHNEV C, GRANNES D. Two approaches to matching in example-based machine translation[C]//Proc of the 5th International Conference on Theoretical and Methodological Issues in Machine Translation. Kyoto: TMI, 1993: 47-57.
[6]	JIN L, LI C, MEHROTRA S. Efficient record linkage in large data sets[C]//Proc of the 8th International Conference on Database System for Advanced Application. Kyoto: DASFAA, 2003: 137-146.
[7]	邵清, 叶琨.基于编辑距离和相似度改进的汉字字符串匹配[J].电子科技, 2016, 29(9):7-11. doi: 10.3969/j.issn.1009-6108.2016.09.004
[8]	廖宏建, 杨玉宝, 唐连章.改进的编辑距离计算及其在自动评分中的应用[J].广州大学学报(自然科学版), 2012, 11(4):79-83. doi: 10.3969/j.issn.1671-4229.2012.04.015
[9]	LEVENSHTEIN V. Binary codes capable of correcting deletions, insertions and reversals[J]. Soviet Physics Doklady, 1966, 10(1):845-848. http://www.ams.org/mathscinet-getitem?mr=189928
[10]	玛依热·依布拉音, 米吉提·阿不里米提, 艾斯卡尔·艾木都拉.基于最小编辑距离的维语词语检错与纠错研究[J].中文信息学报, 2008, 22(3):110-114. doi: 10.3969/j.issn.1003-0077.2008.03.015
[11]	姜华, 韩安琪, 王美佳, 等.基于改进编辑距离的字符串相似度求解算法[J].计算机工程, 2014, 40(1):222-227. http://d.old.wanfangdata.com.cn/Periodical/jsjgc201401047
[12]	何锋, 谷锁林, 陈彦辉.基于编辑距离相似度的文本校验技术研究与应用[J].飞行器测控学报, 2015, 34(4):389-394. http://d.old.wanfangdata.com.cn/Periodical/fxqckxb201504014
[13]	LI G, DENG D, WANG J, et al. Pass-join:A partition-based method for similarity joins[J]. Proceedings of the VLDB Endowment, 2011, 5(3):253-264. doi: 10.14778/2078331