基于大规模语料库的新词检测

New Word Detection Based on Large-Scale Corpus

摘要: 自然语言的发展提出了快速跟踪新词的要求.提出了一种基于大规模语料库的新词检测方法，首先在大规模的Internet生语料上进行中文词法切分，然后在分词的基础上进行频度统计得到大量的候选新词.针对二元新词、三元新词、四元新词等的常见模式，用自学习的方法产生3个垃圾词典和一个词缀词典对候选新词进行垃圾过滤，最后使用词性过滤规则和独立词概率技术进一步过滤.据此实现了一个基于Internet的进行在线新词检测的系统，并取得了令人满意的性能.系统已经可以应用到新词检测、术语库建立、热点命名实体统计和词典编纂等领域.

Abstract: New word detection is a part of unknown word detection. The development of natural languages requires us to detect new words as soon as possible. In this paper, a new approach to detect new words based on large-scale corpus is presented. It first segments the corpus from the Internet with ICTCLAS, and searches for repeated strings, and then designs different filtering mechanisms to separate the true new words from the garbage strings, using rich features of various new word patterns. While getting rid of the garbage strings, three garbage lexicons and a suffix lexicon are used, which are learned by the system, and good results are achieved. Finally, the results of the experiments are discussed, which seem to be promising.