ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (8): 1683-1693.doi: 10.7544/issn1000-1239.2018.20180365

所属专题: 2018数据挖掘前沿进展专题

• 人工智能 • 上一篇    下一篇

基于U统计量和集成学习的基因互作检测方法

郭颖婕1,刘晓燕1,吴辰熙2,郭茂祖1,3,李傲1   

  1. 1(哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001);2(Rutgers大学数学系 美国新泽西洲皮斯卡特维 08854);3(建筑大数据智能处理方法研究北京市重点实验室(北京建筑大学) 北京 100044) (yjguo0625@gmail.com)
  • 出版日期: 2018-08-01
  • 基金资助: 
    国家自然科学基金项目(61571163,61532014,61671189);国家重点研发计划项目(2016YFC0901902) This work was supported by the National Natural Science Foundation of China (61571163, 61532014, 61671189) and the National Key Research and Development Plan of China (2016YFC0901902).

U-Statistics and Ensemble Learning Based Method for Gene-Gene Interaction Detection

Guo Yingjie1, Liu Xiaoyan1, Wu Chenxi2, Guo Maozu1,3,Li Ao1   

  1. 1(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001);2(Department of Mathematics, Rutgers University, Piscataway, NJ, USA 08854);3(Beijing Key Laboratory of Intelligent Processing for Building Big Data (Beijing University of Civil Engineering and Architecture), Beijing 100044)
  • Online: 2018-08-01

摘要: 在全基因组关联研究GWAS中,多数方法对疾病与单核苷酸多态性位点之间的互作关系形式给出了强假设,这降低了相关方法的挖掘能力.近几年,以基因作为研究单位的基因-基因相互作用检测方法,因其在统计效力与生物可解释性方面的优势受到重视.针对已有方法检测相互作用类型时存在的局限性,提出一种基于U统计值与集成学习器的假设检验方法GBUtrees,通过构造统计量用于表征疾病性状与2个基因之间关系偏离加性模型的程度,检测以基因为单位的基因-基因相互作用.该统计量在不同子样例集下结果的平均值满足U统计量理论,从而可以利用U统计量的渐进正态分布性质获得所构造统计量的分布信息.GBUtrees对相互作用的形式不作假设,增强该方法对不同形式相互作用的挖掘能力.仿真与真实实验结果表明:该方法能够有效地进行不同类型相互作用的挖掘,可以应用于全基因组关联研究.

关键词: U统计量, 集成学习, 基因相互作用, 单核苷酸多态性位点, 全基因组关联研究

Abstract: In qualitative genome-wide association studies (GWAS), most existing methods make strong assumptions on the interaction form between genes which limited their power. Lately, many methods for detecting gene-gene interaction have been developed, and among them, the gene-based methods have grown in popularity as they confer an advantage in both statistical power and biological interpretability. In this paper, we propose a hypothesis testing framework for gene-based gene-gene interaction detection based on U statistics and tree-based ensemble learners (GBUtrees). We construct a statistic that detects the deviation from the additive structure in the prediction of log odds ratio of a qualitative trait from each base learner, then average it for learners trained using different subsamples to turn it into the form of U statistics. GBUtrees benefits from both the non-linear modeling power of tree-based ensemble model and the asymptotic normality of U statistics. Our method makes no assumption on the form of interaction, which strengthens its power for detecting different kinds of interactions. Based on simulated studies of eight disease models and real data from the RA pathway in WTCCC dataset, we conclude that it is effective in detecting different kinds of interactions and can be useful for genome-wide association studies.

Key words: U statistics, ensemble learning, gene-gene interaction, single nucleotide polymorphism, genome-wide association studies

中图分类号: