An Approximate Matching Algorithm for Large Scale Lexicons

Gong Caichun; Huang Yulan; Xu Hongbo; Bai Shuo

Gong Caichun, Huang Yulan, Xu Hongbo, Bai Shuo. An Approximate Matching Algorithm for Large Scale LexiconsJ. Journal of Computer Research and Development, 2008, 45(10): 1776-1781.

Citation:

Gong Caichun, Huang Yulan, Xu Hongbo, Bai Shuo. An Approximate Matching Algorithm for Large Scale LexiconsJ. Journal of Computer Research and Development, 2008, 45(10): 1776-1781.

Citation:

Gong Caichun, Huang Yulan, Xu Hongbo, Bai Shuo. An Approximate Matching Algorithm for Large Scale LexiconsJ. Journal of Computer Research and Development, 2008, 45(10): 1776-1781.

An Approximate Matching Algorithm for Large Scale Lexicons

Graphical Abstract

Abstract

Abstract

Approximate lexicon matching is widely used for spelling correction of editors, query suggestion of search engines, post-processing of optical character recognizers and other applications. It is quite difficult for traditional single index schemes to obtain high recall and high performance at the same time. When the background lexicon becomes large, things go from worse to worst. A multiple-indices scheme is presented to handle this problem. The background dictionary is partitioned into several sub-dictionaries, each of which shares the same word length. Unigram, bigram, trigram and quadgram indices are constructed for each sub-dictionary if needed. As for the input pattern P, appropriate matching policies are deployed to obtain the set C of candidate matches, the edit distance between P and each element W in C is then computed, and the final approximate matches are then engendered. When a longer pattern P is queried, only indices of longer n-gram will be used to engender the candidate matches. Whats more, for each query pattern P, only a very small proportion of the input lexicon will be checked, so that this approximate matching scheme is quite efficient than traditional single index schemes. Experiments show that the number of candidate matches is much less so that the matching speed is much more promising.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

An Approximate Matching Algorithm for Large Scale Lexicons

Abstract

Catalog

Export File

Citation

Format

Content