Abstract:
Transformer has gradually become the preferred solution for computer vision tasks, which has promoted the development of its interpretability methods. Traditional interpretation methods mostly use the perturbation mask generated by the Transformer encoder’s final layer to generate an interpretable map. However, these methods ignore uncertain information on the mask and the information loss in the upsampling and downsampling processes, which can result in rough and incomplete positioning of the object area. To overcome the mentioned problems, a Transformer explanation method based on sequential three-way and attention fusion (SAF-Explainer) is proposed. SAF-Explainer mainly includes the sequential three-way mask (S3WM) module and attention fusion (AF) module. The S3WM module processes the mask by applying strict threshold conditions to avoid the uncertainty information in the mask from damaging the interpretation results, so as to effectively locate the object position. Subsequently, the AF module uses attention matrix aggregation to generate a relationship matrix for cross-layer information interaction, which is used to optimize the detailed information in the interpretation results and generate clear and complete interpretation results. To verify the effectiveness of the proposed SAF-Explainer, comparative experiments were conducted on three natural image datasets and one medical image dataset. The results showed that SAF-Explainer has better explainability. This work advances visual explanation techniques by providing more accurate and clinically relevant interpretability for Transformer-based vision systems, particularly in medical diagnostic applications where precise region identification is crucial.