ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (3): 553-560.doi: 10.7544/issn1000-1239.2015.20131546

• 人工智能 •    下一篇

TML:一种通用高效的文本挖掘语言

李佳静1,2,李晓明3,孟涛2   

  1. 1(中国矿业大学(北京)机电与信息工程学院 北京 100083); 2(南京网感至察信息科技有限公司 南京 210014); 3(北京大学信息科学技术学院 北京 100871) (lijiajing@tmlsystem.com)
  • 出版日期: 2015-03-01
  • 基金资助: 
    基金项目:南京市321领军型科技创业人才计划基金项目2013年第2批;中央高校基本科研业务费专项资金项目(2009QJ15);国家“八六三”高技术研究发展计划基金项目(2013AA064303)

TML: A General High-Performance Text Mining Language

Li Jiajing1,2, Li Xiaoming3, Meng Tao2   

  1. 1(School of Mechanical Electronic and Information Engineering, China University of Mining & Technology (Beijing), Beijing 100083); 2(Nanjing Wangganzhicha Information Technology Ltd, Nanjing 210014); 3(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871)
  • Online: 2015-03-01

摘要: 实现了一种通用高效的文本挖掘编程语言,包括其编译器、运行虚拟机和图形开发环境.其工作方式是用户通过编写该语言的代码以定制抽取目标和抽取手段,然后将用户代码编译成字节码并进行优化,再将其与输入文本语义结构做匹配.该语言具有如下特点:1)提供了一种描述文本挖掘的范围、目标和手段的形式化方法,从而能通过编写该语言的代码来在不同应用领域做声明式文本挖掘;2)运行虚拟机以信息抽取技术为核心,高效地实现了多种常用文本挖掘技术,并将其组成一个文本分析流水线;3)通过一系列编译优化技术使得大量匹配指令能够充分并发执行,从而解决了该语言在处理海量规则和海量数据上的执行效率问题.实用案例说明了TML语言的描述能力以及它的实际应用情况.

关键词: 文本挖掘, 信息抽取, 编程语言, 编译器, 虚拟机

Abstract: This paper proposes a general-purpose programming language named TML for text mining. TML is the abbreviation of “text mining language”, and it aims at turning complicated text mining tasks into easy jobs. The implementation of TML includes a compiler, a runtime virtual machine (interpreter), and an IDE. TML has supplied most usual text mining techniques, which are implemented as grammars and reserved words. Users can use TML to program, and the code will be compiled into bytecodes, which will be next interpreted in the virual runtime machine.TML has the following characteristics: 1) It supplies a formal way to model the searching area, object definition and mining methods of text mining jobs, so users can program with it to make a declarative text mining easily; 2) The TML runtime machine implements usual text mining techniques, and organizes them into an efficient text analysis pipeline; 3) The TML compiler fully explores the possibility of concurrently executing its byte codes, and the execution has good performance on very large collections of documents and user-written rules. TML has been used in several large-scale online data analysis applications, including commodity purchase intention analysis, fine-grained reputation analysis of brands and products, and legal document analysis.

Key words: text mining, information extraction, programming language, compiler, virtual machine

中图分类号: