Meena Kowshalya
57193347708
Publications - 1
A vision explainability method for image captioning using transformer decoder attention maps
Publication Name: Methodsx
Publication Date: 2025-12-01
Volume: 15
Issue: Unknown
Page Range: Unknown
Description:
Image Captioning is a crucial task that enables systems to generate descriptive sentences for visual content. Though image captioning systems bloom at the intersection of Computer Vision and Natural Language Processing, these models act mostly as black boxes offering little or no insight into how captions are derived. We present a novel explainable image captioning framework that integrates a Convolutional Neural Network encoder with a Transformer decoder. Attention-based heatmaps are used to explain the visuals offering transparency in the decision making process. The method evaluates captioning quality and interpretability on the MS COCO dataset using BLEU, METEOR, CIDER and SPICE. The method enhances the trustworthiness and transparency, making it reliable for applications like healthcare, education, security, surveillance and forecasting.A reproducible method for integrating visual explainability into image captioning exploring transformer decoder attention maps.The method contributes to the growing body of eXplainable AI (XAI) by addressing the transparency gap in vision-language modelsBalance performance with interpretability paving the way for more transparent and trustworthy AI systems.
Open Access: Yes