ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2019, Vol. 56 ›› Issue (5): 992-1006.doi: 10.7544/issn1000-1239.2019.20180276

Previous Articles     Next Articles

A Parallel Algorithm for Mining Interactive Features from Large Scale Sequences

Zhao Yuhai, Yin Ying, Li Yuan, Wang Siyao, Wang Guoren   

  1. (School of Computer Science and Engineering, Northeastern University, Shenyang 110819)
  • Online:2019-05-01

Abstract: Sequence is an important type of data which is widely existing in various domains, and thus feature selection from sequence data is of practical significance in extensive applications. Interactive features refer to a set of features, each of which is weakly correlated with the target, but the whole of which is strongly correlated with the target. It is of great challenge to mine interactive features from large scale sequence data for the combinatorial explosion problem of loci. To address the problem, against the background of high-throughput sequencing in biology, a parallel evolutionary algorithm for high-order interactive features mining is proposed in this paper. Instead of sequence-block based parallel strategy, the work is inspired by loci-based idea since the number of loci is the fundamental factor that restricts the efficiency. Further, we propose the conception of maximal allelic common subsequence (MACS) and MACS based strategy for feature region partition. According to the strategy, the search range of interactive features is narrowed to many fragged spaces and interactions are guaranteed not to exist among different fragments. Finally, a parallel ant algorithm based on substitution search is developed to conduct interactive feature selection. Extensive experiments on real and synthetic datasets show that the efficiency and effectiveness of the proposed PACOIFS algorithm is superior to that of competitive algorithms.

Key words: interactive features, data mining, large scale sequence, ant colony algorithm, parallel computation, maximal allelic common subsequence (MACS)

CLC Number: