ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (3): 553-560.doi: 10.7544/issn1000-1239.2015.20131546

    Next Articles

TML: A General High-Performance Text Mining Language

Li Jiajing1,2, Li Xiaoming3, Meng Tao2   

  1. 1(School of Mechanical Electronic and Information Engineering, China University of Mining & Technology (Beijing), Beijing 100083); 2(Nanjing Wangganzhicha Information Technology Ltd, Nanjing 210014); 3(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871)
  • Online:2015-03-01

Abstract: This paper proposes a general-purpose programming language named TML for text mining. TML is the abbreviation of “text mining language”, and it aims at turning complicated text mining tasks into easy jobs. The implementation of TML includes a compiler, a runtime virtual machine (interpreter), and an IDE. TML has supplied most usual text mining techniques, which are implemented as grammars and reserved words. Users can use TML to program, and the code will be compiled into bytecodes, which will be next interpreted in the virual runtime machine.TML has the following characteristics: 1) It supplies a formal way to model the searching area, object definition and mining methods of text mining jobs, so users can program with it to make a declarative text mining easily; 2) The TML runtime machine implements usual text mining techniques, and organizes them into an efficient text analysis pipeline; 3) The TML compiler fully explores the possibility of concurrently executing its byte codes, and the execution has good performance on very large collections of documents and user-written rules. TML has been used in several large-scale online data analysis applications, including commodity purchase intention analysis, fine-grained reputation analysis of brands and products, and legal document analysis.

Key words: text mining, information extraction, programming language, compiler, virtual machine

CLC Number: