Abstract:
Approximate lexicon matching is widely used for spelling correction of editors, query suggestion of search engines, post-processing of optical character recognizers and other applications. It is quite difficult for traditional single index schemes to obtain high recall and high performance at the same time. When the background lexicon becomes large, things go from worse to worst. A multiple-indices scheme is presented to handle this problem. The background dictionary is partitioned into several sub-dictionaries, each of which shares the same word length. Unigram, bigram, trigram and quadgram indices are constructed for each sub-dictionary if needed. As for the input pattern P, appropriate matching policies are deployed to obtain the set C of candidate matches, the edit distance between P and each element W in C is then computed, and the final approximate matches are then engendered. When a longer pattern P is queried, only indices of longer n-gram will be used to engender the candidate matches. Whats more, for each query pattern P, only a very small proportion of the input lexicon will be checked, so that this approximate matching scheme is quite efficient than traditional single index schemes. Experiments show that the number of candidate matches is much less so that the matching speed is much more promising.