Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion

Cheng Xiaotian; Ding Weiping; Geng Yu; Huang Jiashuang; Ju Hengrong; Guo Jing

doi:10.7544/issn1000-1239.202440382

Journal of Computer Research and Development > 2025 > Accepted Manuscript > DOI: 10.7544/issn1000-1239.202440382 CSTR: 32373.14.issn1000-1239.202440382

Cheng Xiaotian, Ding Weiping, Geng Yu, Huang Jiashuang, Ju Hengrong, Guo Jing. Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440382

Citation:

PDF (5249 KB)

Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion

School of Artificial Intelligence and Computer Science, Nantong University, Nantong, Jiangsu 226019

Funds: This work was supported by the National Natural Science Foundation of China (61976120, 62006128, 62102199), the Natural Science Foundation of Jiangsu Province (BK20231337), the Double-Creation Doctoral Program of Jiangsu Province, the Natural Science Key Foundation of Higher Education of Jiangsu Province (21KJA510004), the China Postdoctoral Science Foundation (2022M711716), and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX24_2021).

More Information

Author Bio:
Cheng Xiaotian: born in 2001. Master candidate. His main research interests include granular computing, deep learning and computer vision

Ding Weiping: born in 1979. PhD, professor, PhD supervisor, Senior member of CCF. His main research interests include data mining, machine learning, granular computing, evolutionary computing, and big data analytics

Geng Yu: born in 1998. Master candidate. His main research interests include granular computing, machine learning and deep learning

Huang Jiashuang: born in 1988. PhD, associate professor. His main research interests include brain network analysis and deep learning

Ju Hengrong: born in 1989. PhD, associate professor. His main research interests include granular computing, rough sets, machine learning and knowledge discovery

Guo Jing: born in 2000. Master candidate. Her main research interests include granular computing, machine learning and deep learning
Received Date: May 30, 2024
Revised Date: March 09, 2025
Accepted Date: April 03, 2025
Available Online: April 02, 2025

Graphical Abstract

Abstract

Abstract

Transformer has gradually become the preferred solution for computer vision tasks, which has promoted the development of its interpretability methods. Traditional interpretation methods mostly use the perturbation mask generated by the Transformer encoder’s final layer to generate an interpretable map. However, these methods ignore uncertain information on the mask and the information loss in the upsampling and downsampling processes, which can result in rough and incomplete positioning of the object area. To overcome the mentioned problems, a Transformer explanation method based on sequential three-way and attention fusion (SAF-Explainer) is proposed. SAF-Explainer mainly includes the sequential three-way mask (S3WM) module and attention fusion (AF) module. The S3WM module processes the mask by applying strict threshold conditions to avoid the uncertainty information in the mask from damaging the interpretation results, so as to effectively locate the object position. Subsequently, the AF module uses attention matrix aggregation to generate a relationship matrix for cross-layer information interaction, which is used to optimize the detailed information in the interpretation results and generate clear and complete interpretation results. To verify the effectiveness of the proposed SAF-Explainer, comparative experiments were conducted on three natural image datasets and one medical image dataset. The results showed that SAF-Explainer has better explainability. This work advances visual explanation techniques by providing more accurate and clinically relevant interpretability for Transformer-based vision systems, particularly in medical diagnostic applications where precise region identification is crucial.
- interpretable method,
- Transformer,
- self-attention mechanism,
- sequential three-way decision,
- attention fusion,
- perturbation mask

FullText(HTML)

References (44)

References

[1]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6000−6010
[2]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010.11929, 2020
[3]	Chen C F R, Fan Quanfu, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 357−366
[4]	Wang Wenhai, Xie E, Li Xiang, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 568−578
[5]	Ding Weiping, Wang Haipeng, Huang Jiashuang, et al. FTransCNN: Fusing Transformer and a CNN based on fuzzy logic for uncertain medical image segmentation[J]. Information Fusion, 2023, 99: 101880 doi: 10.1016/j.inffus.2023.101880
[6]	Liu Ze, Lin Yutong, Cao Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10012−10022
[7]	Zhou Qianyu, Li Xiangtai, He Lu, et al. TransVOD: End-to-end video object detection with spatial-temporal transformers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(6): 7853−7869
[8]	Ding Weiping, Liu Chuansheng, Huang Jiashuang, et al. ViTH-RFG: Vision Transformer hashing with residual fuzzy generation for targeted attack in medical image retrieval[J]. IEEE Transactions on Fuzzy Systems, 2023, 32(10): 5571−5584
[9]	成科扬,王宁,师文喜,等. 深度学习可解释性研究进展[J]. 计算机研究与发展,2020,57(6):1208−1217 Cheng Keyang, Wang Ning, Shi Wenxi, et al. Research advances in the interpretability of deep learning[J] Journal of Computer Research and Development, 2020, 57(6): 1208−1217 (in Chinese)
[10]	马连韬,张超贺,焦贤锋,等. Dr. Deep:基于医疗特征上下文学习的患者健康状态可解释评估[J]. 计算机研究与发展,2021,58(12):2645−2659 Ma Liantao, Zhang Chaohe, Jiao Hefeng, et al. Dr. Deep: Interpretable evaluation of patient health status via clinical feature’s context learning [J] Journal of Computer Research and Development, 2021, 58(12): 2645−2659 (in Chinese)
[11]	周天奕,丁卫平,黄嘉爽,等. 模糊逻辑引导的多粒度深度神经网络[J]. 模式识别与人工智能,2023,36(9):778−792 Zhou Tianyi, Ding Weiping, Huang Jiashuang, et al. Fuzzy logic guided deep neural network with multi-granularity[J] Pattern Recognition and Artificial Intelligence, 2023, 36(9): 778−792 (in Chinese)
[12]	Zeiler M D, Fergus R. Visualizing and understanding convolutional networks[C] //Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 818−833
[13]	Ding Weiping, Geng Yu, Huang Jiashuang, et al. MGRW-Transformer: Multigranularity random walk Transformer model for interpretable learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 36(1): 1104−1118
[14]	Petsiuk V, Das A, Saenko K. Rise: Randomized input sampling for explanation of black-box models[J]. arXiv preprint, arXiv: 1806.07421, 2018
[15]	Xie Weiyan, Li Xiaohui, Cao C C, et al. ViT-CX: Causal explanation of vision transformers[C] //Proc of the 32nd Int Joint Conf on Artificial Intelligence. San Francisco: Margan Kaufmann, 2023: 1569−1577
[16]	Wang Peihao, Zheng Wenqing, Chen Tianlong, et al. Anti-Oversmoothing in deep vision Transformers via the fourier domain analysis: From theory to practice[C/OL] //Proc of the 10th Int Conf on Learning Representations. La Jolla, CA: ICLR, 2022[2023-03-15]. https://openreview.net/forum?id=O476oWmiNNp
[17]	Ru Lixiang, Zheng Heliang, Zhan Yibing, et al. Token contrast for weakly-supervised semantic segmentation[C] //Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 3093−3102
[18]	Yao Yiyu. Three-way decisions with probabilistic rough sets[J]. Information Sciences, 2010, 180(3): 341−353 doi: 10.1016/j.ins.2009.09.021
[19]	Savchenko A V. Sequential three-way decisions in multi-category image recognition with deep features based on distance factor[J]. Information Sciences, 2019, 489: 18−36 doi: 10.1016/j.ins.2019.03.030
[20]	Ju Hengrong, Pedrycz W, Li Huaxiong, et al. Sequential three-way classifier with justifiable granularity[J]. Knowledge-Based Systems, 2019, 163: 103−119 doi: 10.1016/j.knosys.2018.08.022
[21]	Li Huaxiong, Zhang Libo, Huang Bing, et al. Sequential three-way decision and granulation for cost-sensitive face recognition[J]. Knowledge-Based Systems, 2016, 91: 241−251 doi: 10.1016/j.knosys.2015.07.040
[22]	李金海,李玉斐,米允龙,等. 多粒度形式概念分析的介粒度标记方法[J]. 计算机研究与发展,2020,57(2):447−458 doi: 10.7544/issn1000-1239.2020.20190279 Li Jinhai, Li Yufei, Mi Yunlong, et al. Meso-Granularity labeled method for multi-granularity formal concept analysis[J]. Journal of Computer Research and Development, 2020, 57(2): 447−458 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190279
[23]	张素兰,郭平,张继福,等. 图像语义自动标注及其粒度分析方法[J]. 自动化学报,2012,38(5):688−697 doi: 10.3724/SP.J.1004.2012.00688 Zhang Sulan, Guo Ping, Zhang Jifu, et al. Automatic semantic image annotation with granular analysis method[J]. Acta Automatica Sinica, 2012, 38(5): 688−697 (in Chinese) doi: 10.3724/SP.J.1004.2012.00688
[24]	Abnar S, Zuidema W. Quantifying attention flow in Transformers[C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4190−4197
[25]	Smilkov D, Thorat N, Kim B, et al. Smoothgrad: Removing noise by adding noise[J]. arXiv preprint, arXiv: 1706.03825, 2017
[26]	Sundararajan M, Taly A, Yan Qiqi. Axiomatic attribution for deep networks[C] // Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 3319−3328
[27]	Binder A, Montavon G, Lapuschkin S, et al. Layer-wise relevance propagation for neural networks with local renormalization layers[C] //Proc of the 25th Int Conf on Artificial Neural Networks and Machine Learning. Berlin: Springer, 2016: 63−71
[28]	Voita E, Talbot D, Moiseev F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can Be pruned[C] //Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 5797−5808
[29]	Chefer H, Gur S, Wolf L. Transformer interpretability beyond attention visualization[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 782−791
[30]	Zhou Bolei, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 2921−2929
[31]	Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization[C] //Pro of the 14th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 618−626
[32]	Chattopadhay A, Sarkar A, Howlader P, et al. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks[C] //Proc of the 6th IEEE Winter Conf on Applications of Computer Vision (WACV). Piscataway, NJ: IEEE, 2018: 839−847
[33]	Wang Haofan, Wang Zifan, Du Mengnan, et al. Score-CAM: Score-weighted visual explanations for convolutional neural networks[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 24−25
[34]	Jiang Pengtao, Zhang Changbin, Hou Qibin, et al. LayerCam: Exploring hierarchical class activation maps for localization[J]. IEEE Transactions on Image Processing, 2021, 30: 5875−5888 doi: 10.1109/TIP.2021.3089943
[35]	Chen Kaitao, Sun Shiliang, Du Youtian. Deconfounded multi-organ weakly-supervised semantic segmentation via causal intervention[J]. Information Fusion, 2024, 108: 102355 doi: 10.1016/j.inffus.2024.102355
[36]	Yang Jiaqi, Mehta N, Demirci G, et al. Anomaly-guided weakly supervised lesion segmentation on retinal OCT images[J]. Medical Image Analysis, 2024, 94: 103139 doi: 10.1016/j.media.2024.103139
[37]	Ma Jie, Bai Yalong, Zhong Bineng, et al. Visualizing and understanding patch interactions in vision transformer[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 35(10): 13671−13680
[38]	Deng Jia, Dong Wei, Socher R, et al. ImageNet: A large-scale hierarchical image database[C] // Proc of the 22nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2009: 248−255
[39]	Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C] // Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 740−755
[40]	Baid U, Ghodasara S, Mohan S, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification[J]. arXiv preprint, arXiv: 2107.02314, 2021
[41]	Menze B H, Jakab A, Bauer S, et al. The multimodal brain tumor image segmentation benchmark (BRATS)[J]. IEEE Transactions on Medical Imaging, 2014, 34(10): 1993−2024
[42]	Bakas S, Akbari H, Sotiras A, et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features[J]. Scientific Data, 2017, 4(1): 1−13
[43]	Chen Y J, Hu Xinrong, Shi Yiyu, et al. AME-CAM: Attentive multiple-exit CAM for weakly supervised segmentation on MRI brain tumor[C] // Proc of the 26th Int Conf on Medical Image Computing and Computer-Assisted Intervention. Berlin: Springer, 2023: 173−182
[44]	Everingham M, Van Gool L, Williams C K I, et al. The pascal visual object classes (voc) challenge[J]. International Journal of Computer Vision, 2010, 88: 303−338 doi: 10.1007/s11263-009-0275-4