高级检索

    面向大规模语料的语言模型研究新进展

    A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling

    • 摘要: N元语言模型是统计机器翻译、信息检索、语音识别等很多自然语言处理研究领域的重要工具.由于扩大训练语料规模和增加元数对于提高系统性能很有帮助,随着可用语料迅速增加,面向大规模训练语料的高元语言模型(如N≥5)的训练和使用成为新的研究热点.介绍了当前这个问题的最新研究进展,包括了集成数据分治、压缩和内存映射的一体化方法,基于随机存取模型的表示方法,以及基于分布式并行体系的语言模型训练与查询方法等几种代表性的方法,展示了它们在统计机器翻译中的性能,并比较了这些方法的优缺点.

       

      Abstract: N-gram language model (LM) is a key component in many research areas of natural language processing, such as statistical machine translation, information retrieval, speech recognition, etc. Using higher-order models and more training data can significantly improve the performance of applications. However, for limited resources of the systems (e.g., memory, usage of CPU, etc), the cost of training and accessing large-scale LM becomes prohibitive with more and more monolingual corpora available. Therefore, the research on large-scale language modeling draws more attention. The authors introduce the state-of-the-art of the ideas and progress of the issue, which focuses on some representative approaches, including an ad hoc method, a randomized representation model and a distributed parallel framework. The ad hoc method is a unified one integrating division and conquering of data, compact data structrue, data compression based on quantization and memory mapping. The randomized representation of LM is a lossy compression model based on Bloom filter. The distributed parallel framework carries out the training of LM based on MapReduce and performs the requests of N-grams in a batch mode of remote call. The performance of systems of statistical machine translation utilizing the approaches is described respectively with experiments, and finally pros and cons are compared.

       

    /

    返回文章
    返回