• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Cheng Xiaotian, Ding Weiping, Geng Yu, Huang Jiashuang, Ju Hengrong, Guo Jing. Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440382
Citation: Cheng Xiaotian, Ding Weiping, Geng Yu, Huang Jiashuang, Ju Hengrong, Guo Jing. Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440382

Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion

Funds: This work was supported by the National Natural Science Foundation of China (61976120, 62006128, 62102199), the Natural Science Foundation of Jiangsu Province (BK20231337), the Double-Creation Doctoral Program of Jiangsu Province, the Natural Science Key Foundation of Higher Education of Jiangsu Province (21KJA510004), the China Postdoctoral Science Foundation (2022M711716), and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX24_2021).
More Information
  • Author Bio:

    Cheng Xiaotian: born in 2001. Master candidate. His main research interests include granular computing, deep learning and computer vision

    Ding Weiping: born in 1979. PhD, professor, PhD supervisor, Senior member of CCF. His main research interests include data mining, machine learning, granular computing, evolutionary computing, and big data analytics

    Geng Yu: born in 1998. Master candidate. His main research interests include granular computing, machine learning and deep learning

    Huang Jiashuang: born in 1988. PhD, associate professor. His main research interests include brain network analysis and deep learning

    Ju Hengrong: born in 1989. PhD, associate professor. His main research interests include granular computing, rough sets, machine learning and knowledge discovery

    Guo Jing: born in 2000. Master candidate. Her main research interests include granular computing, machine learning and deep learning

  • Received Date: May 30, 2024
  • Revised Date: March 09, 2025
  • Accepted Date: April 03, 2025
  • Available Online: April 02, 2025
  • Transformer has gradually become the preferred solution for computer vision tasks, which has promoted the development of its interpretability methods. Traditional interpretation methods mostly use the perturbation mask generated by the Transformer encoder’s final layer to generate an interpretable map. However, these methods ignore uncertain information on the mask and the information loss in the upsampling and downsampling processes, which can result in rough and incomplete positioning of the object area. To overcome the mentioned problems, a Transformer explanation method based on sequential three-way and attention fusion (SAF-Explainer) is proposed. SAF-Explainer mainly includes the sequential three-way mask (S3WM) module and attention fusion (AF) module. The S3WM module processes the mask by applying strict threshold conditions to avoid the uncertainty information in the mask from damaging the interpretation results, so as to effectively locate the object position. Subsequently, the AF module uses attention matrix aggregation to generate a relationship matrix for cross-layer information interaction, which is used to optimize the detailed information in the interpretation results and generate clear and complete interpretation results. To verify the effectiveness of the proposed SAF-Explainer, comparative experiments were conducted on three natural image datasets and one medical image dataset. The results showed that SAF-Explainer has better explainability. This work advances visual explanation techniques by providing more accurate and clinically relevant interpretability for Transformer-based vision systems, particularly in medical diagnostic applications where precise region identification is crucial.

  • [1]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6000−6010
    [2]
    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010.11929, 2020
    [3]
    Chen C F R, Fan Quanfu, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 357−366
    [4]
    Wang Wenhai, Xie E, Li Xiang, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 568−578
    [5]
    Ding Weiping, Wang Haipeng, Huang Jiashuang, et al. FTransCNN: Fusing Transformer and a CNN based on fuzzy logic for uncertain medical image segmentation[J]. Information Fusion, 2023, 99: 101880 doi: 10.1016/j.inffus.2023.101880
    [6]
    Liu Ze, Lin Yutong, Cao Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10012−10022
    [7]
    Zhou Qianyu, Li Xiangtai, He Lu, et al. TransVOD: End-to-end video object detection with spatial-temporal transformers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(6): 7853−7869
    [8]
    Ding Weiping, Liu Chuansheng, Huang Jiashuang, et al. ViTH-RFG: Vision Transformer hashing with residual fuzzy generation for targeted attack in medical image retrieval[J]. IEEE Transactions on Fuzzy Systems, 2023, 32(10): 5571−5584
    [9]
    成科扬,王宁,师文喜,等. 深度学习可解释性研究进展[J]. 计算机研究与发展,2020,57(6):1208−1217

    Cheng Keyang, Wang Ning, Shi Wenxi, et al. Research advances in the interpretability of deep learning[J] Journal of Computer Research and Development, 2020, 57(6): 1208−1217 (in Chinese)
    [10]
    马连韬,张超贺,焦贤锋,等. Dr. Deep:基于医疗特征上下文学习的患者健康状态可解释评估[J]. 计算机研究与发展,2021,58(12):2645−2659

    Ma Liantao, Zhang Chaohe, Jiao Hefeng, et al. Dr. Deep: Interpretable evaluation of patient health status via clinical feature’s context learning [J] Journal of Computer Research and Development, 2021, 58(12): 2645−2659 (in Chinese)
    [11]
    周天奕,丁卫平,黄嘉爽,等. 模糊逻辑引导的多粒度深度神经网络[J]. 模式识别与人工智能,2023,36(9):778−792

    Zhou Tianyi, Ding Weiping, Huang Jiashuang, et al. Fuzzy logic guided deep neural network with multi-granularity[J] Pattern Recognition and Artificial Intelligence, 2023, 36(9): 778−792 (in Chinese)
    [12]
    Zeiler M D, Fergus R. Visualizing and understanding convolutional networks[C] //Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 818−833
    [13]
    Ding Weiping, Geng Yu, Huang Jiashuang, et al. MGRW-Transformer: Multigranularity random walk Transformer model for interpretable learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 36(1): 1104−1118
    [14]
    Petsiuk V, Das A, Saenko K. Rise: Randomized input sampling for explanation of black-box models[J]. arXiv preprint, arXiv: 1806.07421, 2018
    [15]
    Xie Weiyan, Li Xiaohui, Cao C C, et al. ViT-CX: Causal explanation of vision transformers[C] //Proc of the 32nd Int Joint Conf on Artificial Intelligence. San Francisco: Margan Kaufmann, 2023: 1569−1577
    [16]
    Wang Peihao, Zheng Wenqing, Chen Tianlong, et al. Anti-Oversmoothing in deep vision Transformers via the fourier domain analysis: From theory to practice[C/OL] //Proc of the 10th Int Conf on Learning Representations. La Jolla, CA: ICLR, 2022[2023-03-15]. https://openreview.net/forum?id=O476oWmiNNp
    [17]
    Ru Lixiang, Zheng Heliang, Zhan Yibing, et al. Token contrast for weakly-supervised semantic segmentation[C] //Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 3093−3102
    [18]
    Yao Yiyu. Three-way decisions with probabilistic rough sets[J]. Information Sciences, 2010, 180(3): 341−353 doi: 10.1016/j.ins.2009.09.021
    [19]
    Savchenko A V. Sequential three-way decisions in multi-category image recognition with deep features based on distance factor[J]. Information Sciences, 2019, 489: 18−36 doi: 10.1016/j.ins.2019.03.030
    [20]
    Ju Hengrong, Pedrycz W, Li Huaxiong, et al. Sequential three-way classifier with justifiable granularity[J]. Knowledge-Based Systems, 2019, 163: 103−119 doi: 10.1016/j.knosys.2018.08.022
    [21]
    Li Huaxiong, Zhang Libo, Huang Bing, et al. Sequential three-way decision and granulation for cost-sensitive face recognition[J]. Knowledge-Based Systems, 2016, 91: 241−251 doi: 10.1016/j.knosys.2015.07.040
    [22]
    李金海,李玉斐,米允龙,等. 多粒度形式概念分析的介粒度标记方法[J]. 计算机研究与发展,2020,57(2):447−458 doi: 10.7544/issn1000-1239.2020.20190279

    Li Jinhai, Li Yufei, Mi Yunlong, et al. Meso-Granularity labeled method for multi-granularity formal concept analysis[J]. Journal of Computer Research and Development, 2020, 57(2): 447−458 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190279
    [23]
    张素兰,郭平,张继福,等. 图像语义自动标注及其粒度分析方法[J]. 自动化学报,2012,38(5):688−697 doi: 10.3724/SP.J.1004.2012.00688

    Zhang Sulan, Guo Ping, Zhang Jifu, et al. Automatic semantic image annotation with granular analysis method[J]. Acta Automatica Sinica, 2012, 38(5): 688−697 (in Chinese) doi: 10.3724/SP.J.1004.2012.00688
    [24]
    Abnar S, Zuidema W. Quantifying attention flow in Transformers[C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4190−4197
    [25]
    Smilkov D, Thorat N, Kim B, et al. Smoothgrad: Removing noise by adding noise[J]. arXiv preprint, arXiv: 1706.03825, 2017
    [26]
    Sundararajan M, Taly A, Yan Qiqi. Axiomatic attribution for deep networks[C] // Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 3319−3328
    [27]
    Binder A, Montavon G, Lapuschkin S, et al. Layer-wise relevance propagation for neural networks with local renormalization layers[C] //Proc of the 25th Int Conf on Artificial Neural Networks and Machine Learning. Berlin: Springer, 2016: 63−71
    [28]
    Voita E, Talbot D, Moiseev F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can Be pruned[C] //Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 5797−5808
    [29]
    Chefer H, Gur S, Wolf L. Transformer interpretability beyond attention visualization[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 782−791
    [30]
    Zhou Bolei, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 2921−2929
    [31]
    Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization[C] //Pro of the 14th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 618−626
    [32]
    Chattopadhay A, Sarkar A, Howlader P, et al. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks[C] //Proc of the 6th IEEE Winter Conf on Applications of Computer Vision (WACV). Piscataway, NJ: IEEE, 2018: 839−847
    [33]
    Wang Haofan, Wang Zifan, Du Mengnan, et al. Score-CAM: Score-weighted visual explanations for convolutional neural networks[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 24−25
    [34]
    Jiang Pengtao, Zhang Changbin, Hou Qibin, et al. LayerCam: Exploring hierarchical class activation maps for localization[J]. IEEE Transactions on Image Processing, 2021, 30: 5875−5888 doi: 10.1109/TIP.2021.3089943
    [35]
    Chen Kaitao, Sun Shiliang, Du Youtian. Deconfounded multi-organ weakly-supervised semantic segmentation via causal intervention[J]. Information Fusion, 2024, 108: 102355 doi: 10.1016/j.inffus.2024.102355
    [36]
    Yang Jiaqi, Mehta N, Demirci G, et al. Anomaly-guided weakly supervised lesion segmentation on retinal OCT images[J]. Medical Image Analysis, 2024, 94: 103139 doi: 10.1016/j.media.2024.103139
    [37]
    Ma Jie, Bai Yalong, Zhong Bineng, et al. Visualizing and understanding patch interactions in vision transformer[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 35(10): 13671−13680
    [38]
    Deng Jia, Dong Wei, Socher R, et al. ImageNet: A large-scale hierarchical image database[C] // Proc of the 22nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2009: 248−255
    [39]
    Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C] // Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 740−755
    [40]
    Baid U, Ghodasara S, Mohan S, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification[J]. arXiv preprint, arXiv: 2107.02314, 2021
    [41]
    Menze B H, Jakab A, Bauer S, et al. The multimodal brain tumor image segmentation benchmark (BRATS)[J]. IEEE Transactions on Medical Imaging, 2014, 34(10): 1993−2024
    [42]
    Bakas S, Akbari H, Sotiras A, et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features[J]. Scientific Data, 2017, 4(1): 1−13
    [43]
    Chen Y J, Hu Xinrong, Shi Yiyu, et al. AME-CAM: Attentive multiple-exit CAM for weakly supervised segmentation on MRI brain tumor[C] // Proc of the 26th Int Conf on Medical Image Computing and Computer-Assisted Intervention. Berlin: Springer, 2023: 173−182
    [44]
    Everingham M, Van Gool L, Williams C K I, et al. The pascal visual object classes (voc) challenge[J]. International Journal of Computer Vision, 2010, 88: 303−338 doi: 10.1007/s11263-009-0275-4
  • Related Articles

    [1]Zhou Yuanding, Gao Guopeng, Fang Yaodong, Qin Chuan. Perceptual Authentication Hashing with Image Feature Fusion Based on Window Self-Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330669
    [2]Attention-enhanced Semantic Fusion Knowledge Graph Representation Learning Framework[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440669
    [3]Han Songshen, Guo Songhui, Xu Kaiyong, Yang Bo, Yu Miao. Perturbation Analysis of the Vital Region in Speech Adversarial Example Based on Frame Structure[J]. Journal of Computer Research and Development, 2024, 61(3): 685-700. DOI: 10.7544/issn1000-1239.202221034
    [4]Hu Yu, Chen Xiaobo, Liang Jun, Chen Ling, Liang Shurong. Vehicle Re-Identification Method Based on Part Features and Multi-Attention Fusion[J]. Journal of Computer Research and Development, 2022, 59(11): 2497-2506. DOI: 10.7544/issn1000-1239.20210599
    [5]Hua Yang, Li Jinxing, Feng Zhenhua, Song Xiaoning, Sun Jun, Yu Dongjun. Protein-Drug Interaction Prediction Based on Attention Feature Fusion[J]. Journal of Computer Research and Development, 2022, 59(9): 2051-2065. DOI: 10.7544/issn1000-1239.20210134
    [6]Wang Honglin, Yang Dan, Nie Tiezheng, Kou Yue. Attributed Heterogeneous Information Network Embedding with Self-Attention Mechanism for Product Recommendation[J]. Journal of Computer Research and Development, 2022, 59(7): 1509-1521. DOI: 10.7544/issn1000-1239.20210016
    [7]Zhu Suxia, Wang Lei, Sun Guanglu. A Perturbation Mechanism for Classified Transformation Satisfying Local Differential Privacy[J]. Journal of Computer Research and Development, 2022, 59(2): 430-439. DOI: 10.7544/issn1000-1239.20200717
    [8]Shen Yijie, Li Liangcheng, Liu Ziwei, Liu Tiantian, Luo Hao, Shen Ting, Lin Feng, Ren Kui. Stealthy Attack Towards Speaker Recognition Based on One-“Audio Pixel” Perturbation[J]. Journal of Computer Research and Development, 2021, 58(11): 2350-2363. DOI: 10.7544/issn1000-1239.2021.20210632
    [9]Zhai Zhigang, Wang Jiandong, Cao Zining, Mao Yuguang. Hybrid Role Mining Methods with Minimal Perturbation[J]. Journal of Computer Research and Development, 2013, 50(5): 951-960.
    [10]Wu Zhiping, Ye Dingfeng, and Ma Weiju. Perturbed Variant of TTM Cryptosystem[J]. Journal of Computer Research and Development, 2006, 43(12): 2082-2087.

Catalog

    Article views (34) PDF downloads (16) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return