高级检索

    基于序贯三支掩码和注意力融合的Transformer解释方法

    Transformer Interpretation Method Based on Sequential Three-Way Mask and Attention Fusion

    • 摘要: Transformer逐渐成为计算机视觉任务的首选方案,这推动了其可解释性方法的发展. 传统解释方法大多采用Transformer编码器的最终层生成的扰动掩码生成可解释图,而忽略了掩码的不确定信息和上下采样中的信息丢失,从而造成物体区域的定位粗糙且不完整. 为克服上述问题,提出基于序贯三支掩码和注意力融合的Transformer解释方法(SAF-Explainer),SAF-Explainer主要包含序贯三支掩码(sequential three-way mask,S3WM)模块和注意力融合(attention fusion,AF)模块.S3WM通过应用严格的阈值条件处理掩码,避免掩码中的不确定信息对解释结果产生损害,以此有效定位到物体位置. 随后,AF利用注意力矩阵聚合生成跨层信息交互的关系矩阵,用来优化解释结果中的细节信息,生成边缘清晰且完整的解释结果. 为验证所提出SAF-Explainer有效性,在3个自然图像与1个医学图像数据集上进行比较实验,结果表明SAF-Explainer具有更好的可解释性效果.

       

      Abstract: Transformer has gradually become the preferred solution for computer vision tasks, which has promoted the development of its interpretability methods. Traditional interpretation methods mostly use the perturbation mask generated by the Transformer encoder’s final layer to generate an interpretable map. However, these methods ignore uncertain information on the mask and the information loss in the upsampling and downsampling processes, which can result in rough and incomplete positioning of the object area. To overcome the mentioned problems, a Transformer explanation method based on sequential three-way and attention fusion (SAF-Explainer) is proposed. SAF-Explainer mainly includes the sequential three-way mask (S3WM) module and attention fusion (AF) module. The S3WM module processes the mask by applying strict threshold conditions to avoid the uncertainty information in the mask from damaging the interpretation results, so as to effectively locate the object position. Subsequently, the AF module uses attention matrix aggregation to generate a relationship matrix for cross-layer information interaction, which is used to optimize the detailed information in the interpretation results and generate clear and complete interpretation results. To verify the effectiveness of the proposed SAF-Explainer, comparative experiments were conducted on three natural image datasets and one medical image dataset. The results showed that SAF-Explainer has better explainability. This work advances visual explanation techniques by providing more accurate and clinically relevant interpretability for Transformer-based vision systems, particularly in medical diagnostic applications where precise region identification is crucial.

       

    /

    返回文章
    返回