基于多重索引模型的大规模词典近似匹配算法

龚才春; 黄玉兰; 许洪波; 白  硕

基于多重索引模型的大规模词典近似匹配算法

An Approximate Matching Algorithm for Large Scale Lexicons

摘要

摘要: 编辑器的拼写校正、搜索引擎的查询纠正、光学字符识别的结果检查等领域都用到词典近似匹配算法.传统单索引模式很难在高性能的前提下保证高召回率.词典越大问题越严重.提出了大规模词典近似匹配的多重索引模型，首先将背景词典根据单词长度划分为若干子词典，对各子词典按照一定策略建立unigram,bigram,trigram,quadgram中的一种或若干种索引，当查找用户模式P的近似匹配时，根据模式P检索特定N-gram索引链，从而得到候选近似匹配集合C，对C中每一个单词W，计算P与W的编辑距离即可输出P的所有最终匹配结果R.实验表明，基于多重索引模型的词典近似匹配算法能够大幅度减少候选近似匹配结果的数量，从而提高词典近似匹配的速度.

Abstract: Approximate lexicon matching is widely used for spelling correction of editors, query suggestion of search engines, post-processing of optical character recognizers and other applications. It is quite difficult for traditional single index schemes to obtain high recall and high performance at the same time. When the background lexicon becomes large, things go from worse to worst. A multiple-indices scheme is presented to handle this problem. The background dictionary is partitioned into several sub-dictionaries, each of which shares the same word length. Unigram, bigram, trigram and quadgram indices are constructed for each sub-dictionary if needed. As for the input pattern P, appropriate matching policies are deployed to obtain the set C of candidate matches, the edit distance between P and each element W in C is then computed, and the final approximate matches are then engendered. When a longer pattern P is queried, only indices of longer n-gram will be used to engender the candidate matches. Whats more, for each query pattern P, only a very small proportion of the input lexicon will be checked, so that this approximate matching scheme is quite efficient than traditional single index schemes. Experiments show that the number of candidate matches is much less so that the matching speed is much more promising.

HTML全文

参考文献(0)

施引文献

资源附件(0)