Abstract
Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions lacking specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based image captioning framework built upon a vision-language model. Our approach enhances the base model by incorporating beam search decoding to encourage explicit knowledge expression, integrating attention-based modules into the image encoder to improve visual feature representation, and employing training schedulers to improve optimization reliability and consistency of downstream performance. These components jointly lead to substantial improvements in both caption quality and knowledge recognition. On the KnowCap dataset, recognition accuracy improves from 50.40% to 63.30%, accompanied by a modest increase in CIDEr (+2.3), indicating enhanced factual grounding without sacrificing caption fluency. Moreover, KRCapVLM demonstrates strong generalization to previously unseen knowledge categories, producing more informative and contextually grounded captions when explicit real-world concepts are present. Overall, our results highlight the effectiveness of KRCapVLM in advancing knowledge-aware image captioning while maintaining robust performance on generic captioning benchmarks.
| Original language | English |
|---|---|
| Article number | 113568 |
| Journal | Pattern Recognition |
| Volume | 179 |
| DOIs | |
| State | Published - Nov 2026 |
Bibliographical note
Publisher Copyright:© 2026
Keywords
- Computer vision
- Image captioning
- Knowledge recognition
- Pattern recognition
- Vision-language models
ASJC Scopus subject areas
- Software
- Signal Processing
- Computer Vision and Pattern Recognition
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'KRCapVLM: Beam-guided knowledge replay for knowledge-rich image captioning using vision-language model'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver