Abstract:
N-gram language model (LM) is a key component in many research areas of natural language processing, such as statistical machine translation, information retrieval, speech recognition, etc. Using higher-order models and more training data can significantly improve the performance of applications. However, for limited resources of the systems (e.g., memory, usage of CPU, etc), the cost of training and accessing large-scale LM becomes prohibitive with more and more monolingual corpora available. Therefore, the research on large-scale language modeling draws more attention. The authors introduce the state-of-the-art of the ideas and progress of the issue, which focuses on some representative approaches, including an ad hoc method, a randomized representation model and a distributed parallel framework. The ad hoc method is a unified one integrating division and conquering of data, compact data structrue, data compression based on quantization and memory mapping. The randomized representation of LM is a lossy compression model based on Bloom filter. The distributed parallel framework carries out the training of LM based on MapReduce and performs the requests of N-grams in a batch mode of remote call. The performance of systems of statistical machine translation utilizing the approaches is described respectively with experiments, and finally pros and cons are compared.