ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2014, Vol. 51 ›› Issue (10): 2148-2159.doi: 10.7544/issn1000-1239.2014.20130572

• 人工智能 • 上一篇    下一篇



  1. 1(山西大学计算机与信息技术学院 太原 030006);2(计算智能与中文信息处理教育部重点实验室(山西大学) 太原 030006) (
  • 出版日期: 2014-10-01
  • 基金资助: 

A Pattern Class Mining Model Based on Active Learning

Guo Husheng1, Wang Wenjian1,2   

  1. 1(School of Computer and Information Technology, Shanxi University, Taiyuan 030006); 2(Key Laboratory of Computational Intelligence and Chinese Information Processing (Shanxi University), Ministry of Education, Taiyuan 030006)
  • Online: 2014-10-01

摘要: 在实际应用问题中,由于客观世界物质的多样性、模糊性和复杂性,经常会遇到大量未知样本类别信息的数据挖掘问题,而传统方法往往都依赖于已知样本类别信息才能对数据进行有效挖掘,对于未知模式类别信息的多类数据目前还没有有效的处理方法.针对未知类别信息的多类样本挖掘问题,提出了一种基于主动学习的模式类别挖掘模型(pattern class mining model based on active learning, PM_AL)来解决未知类别信息的模式类别挖掘问题.该模型通过衡量已得到的模式类别与未标记样本间的关系,引入样本差异度的方法来抽取最有价值样本,通过主动学习方式以较小的标记代价快速挖掘无标记样本所蕴含的可能模式类别,从而有助于将无类别标记的多分类问题转化成有类别标记的多分类问题.实验结果表明,PM_AL算法能够以较小的标记代价处理无类别信息的模式类别挖掘问题.

关键词: 模式类别挖掘, 主动学习, PM_AL模型, 差异度, 标记代价

Abstract: In practical applications, there are a lot of data mining problems with unknown class information for the diversity, fuzziness and complexity of objective world. However, traditional methods are generally based on the categories of data which is known before mining, while there are no effective methods for solving this kind of problems. To solve this kind of problems, this paper presents a pattern class mining model based on active learning, namely PM_AL. Firstly, by the difference measurements between the unlabeled samples and labeled samples, some samples are selected as the most valuable samples according to active learning technique. Then these valuable samples are labeled by experts, and the model quickly mines pattern classes implicated in unlabeled samples. Hence, the most valuable samples will be extracted, and the model can quickly mine pattern categories implicated in unlabeled samples. Therefore, a non-labeling multi-class problem can be transferred into a labeling multi-class problem with the very low labeling cost. Through active learning during initial classes mining, the proposed PM_AL model can obtain high learning efficiency, low labeling cost and good generalization performance. The experiment results demonstrate that PM_AL model can effectively find categories as many as possible and solve the large scale multiple classification problems with unknown categories.

Key words: pattern class mining, active learning, PM_AL model, discrepancy, labeling cost