TML：一种通用高效的文本挖掘语言

李佳静; 李晓明; 孟涛

doi:10.7544/issn1000-1239.2015.20131546

TML：一种通用高效的文本挖掘语言

TML: A General High-Performance Text Mining Language

摘要

摘要: 实现了一种通用高效的文本挖掘编程语言,包括其编译器、运行虚拟机和图形开发环境.其工作方式是用户通过编写该语言的代码以定制抽取目标和抽取手段,然后将用户代码编译成字节码并进行优化,再将其与输入文本语义结构做匹配.该语言具有如下特点：1)提供了一种描述文本挖掘的范围、目标和手段的形式化方法,从而能通过编写该语言的代码来在不同应用领域做声明式文本挖掘；2)运行虚拟机以信息抽取技术为核心,高效地实现了多种常用文本挖掘技术,并将其组成一个文本分析流水线；3)通过一系列编译优化技术使得大量匹配指令能够充分并发执行,从而解决了该语言在处理海量规则和海量数据上的执行效率问题.实用案例说明了TML语言的描述能力以及它的实际应用情况.

Abstract: This paper proposes a general-purpose programming language named TML for text mining. TML is the abbreviation of “text mining language”, and it aims at turning complicated text mining tasks into easy jobs. The implementation of TML includes a compiler, a runtime virtual machine (interpreter), and an IDE. TML has supplied most usual text mining techniques, which are implemented as grammars and reserved words. Users can use TML to program, and the code will be compiled into bytecodes, which will be next interpreted in the virual runtime machine.TML has the following characteristics: 1) It supplies a formal way to model the searching area, object definition and mining methods of text mining jobs, so users can program with it to make a declarative text mining easily; 2) The TML runtime machine implements usual text mining techniques, and organizes them into an efficient text analysis pipeline; 3) The TML compiler fully explores the possibility of concurrently executing its byte codes, and the execution has good performance on very large collections of documents and user-written rules. TML has been used in several large-scale online data analysis applications, including commodity purchase intention analysis, fine-grained reputation analysis of brands and products, and legal document analysis.

HTML全文

参考文献(0)

施引文献

资源附件(0)