• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Luo Weihua, Liu Qun, Bai Shuo. A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling[J]. Journal of Computer Research and Development, 2009, 46(10): 1704-1712.
Citation: Luo Weihua, Liu Qun, Bai Shuo. A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling[J]. Journal of Computer Research and Development, 2009, 46(10): 1704-1712.

A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling

More Information
  • Published Date: October 14, 2009
  • N-gram language model (LM) is a key component in many research areas of natural language processing, such as statistical machine translation, information retrieval, speech recognition, etc. Using higher-order models and more training data can significantly improve the performance of applications. However, for limited resources of the systems (e.g., memory, usage of CPU, etc), the cost of training and accessing large-scale LM becomes prohibitive with more and more monolingual corpora available. Therefore, the research on large-scale language modeling draws more attention. The authors introduce the state-of-the-art of the ideas and progress of the issue, which focuses on some representative approaches, including an ad hoc method, a randomized representation model and a distributed parallel framework. The ad hoc method is a unified one integrating division and conquering of data, compact data structrue, data compression based on quantization and memory mapping. The randomized representation of LM is a lossy compression model based on Bloom filter. The distributed parallel framework carries out the training of LM based on MapReduce and performs the requests of N-grams in a batch mode of remote call. The performance of systems of statistical machine translation utilizing the approaches is described respectively with experiments, and finally pros and cons are compared.
  • Related Articles

    [1]Wei Jia, Zhang Xingjun, Wang Longxiang, Zhao Mingqiang, Dong Xiaoshe. MC2 Energy Consumption Model for Massively Distributed Data Parallel Training of Deep Neural Network[J]. Journal of Computer Research and Development, 2024, 61(12): 2985-3004. DOI: 10.7544/issn1000-1239.202330164
    [2]Yang Zhenkun, Yang Chuanhui, Han Fusheng, Wang Guoping, Yang Zhifeng, Cheng Xiaojun. Architecture and Technology of OceanBase Distributed Relational Database[J]. Journal of Computer Research and Development, 2024, 61(3): 540-554. DOI: 10.7544/issn1000-1239.202330835
    [3]Xu Ke, Li Yanbiao, Xie Gaogang, Zhang Dafang. Efficient Name Lookup Method Based on Hybrid Counting Bloom Filters[J]. Journal of Computer Research and Development, 2023, 60(5): 1136-1150. DOI: 10.7544/issn1000-1239.202111242
    [4]Xu Kunhao, Nie Tiezheng, Shen Derong, Kou Yue, Yu Ge. Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture[J]. Journal of Computer Research and Development, 2021, 58(3): 598-608. DOI: 10.7544/issn1000-1239.2021.20190567
    [5]Lu Le, Sun Yu’e, Huang He, Wang Runzhi, Cao Zhen. Detection of Persistent Elements in Distributed Monitoring System[J]. Journal of Computer Research and Development, 2020, 57(5): 1046-1056. DOI: 10.7544/issn1000-1239.2020.20190287
    [6]Cui Xingcan, Yu Xiaohui, Liu Yang, Lü Zhaoyang. Distributed Stream Processing: A Survey[J]. Journal of Computer Research and Development, 2015, 52(2): 318-332. DOI: 10.7544/issn1000-1239.2015.20140268
    [7]Zheng Liping, Chan Bin, Wang Wenping, Liu Xiaoping, Cao Li, Kuang Zhengzheng. Remote Visualization Based on Distributed Rendering Framework[J]. Journal of Computer Research and Development, 2012, 49(7): 1438-1449.
    [8]Jiang Guiyuan, Zhang Guiling, and Zhang Dakun. A Distributed Parallel Algorithm for SIFT Feature Extraction[J]. Journal of Computer Research and Development, 2012, 49(5): 1130-1141.
    [9]Hu Kongfa, Chen Ling, Zhao Maoxian, Da Qingli, Ji Zhaohui. DHMC:An Improved Parallel & Distributed Storage Structure for High-Dimensional Cube[J]. Journal of Computer Research and Development, 2007, 44(12): 2098-2105.
    [10]Jia Xiaolin, Qin Zheng, He Jian, and Yu Fan. A Distributed Software Architecture Description Language Based on Attributed Grammar[J]. Journal of Computer Research and Development, 2006, 43(1): 54-60.

Catalog

    Article views (871) PDF downloads (1065) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return