AN IMAGE CAPTIONING MODEL INTEGRATING KNOWLEDGE GRAPHS AND DEEP LEARNING | Nguyên | TNU Journal of Science and Technology

AN IMAGE CAPTIONING MODEL INTEGRATING KNOWLEDGE GRAPHS AND DEEP LEARNING

About this article

Received: 17/04/25                Revised: 16/06/25                Published: 27/06/25

Authors

1. Nguyen Do Thai Nguyen Email to author, Ho Chi Minh City University of Education
2. Nguyen Van Tuan, Ho Chi Minh City University of Education
3. Nguyen Ngoc Phu Ty, Ho Chi Minh City University of Education
4. Nguyen Huu Minh Quan, Ho Chi Minh City University of Education

Abstract


This study proposes a novel image captioning model that integrates knowledge graphs and deep learning to enhance semantic understanding and generate more accurate image descriptions. The research aims to address the limitations of conventional captioning approaches that often overlook the relationships between entities within an image. Our method involves generating scene graphs from input images, which are then enriched with external knowledge from structured knowledge graphs to generate semantically rich captions. The model is trained and evaluated on standard datasets, including MSCOCO and Visual Genome. Experimental results demonstrate that the proposed model outperforms existing baselines in terms of BLEU 41.3 and METEOR 31.6, especially in complex scenes with multiple entities. Furthermore, the use of knowledge graph augmentation significantly improves the contextual relevance and informativeness of the generated captions. This research contributes to advancing multi-objects image captioning and highlights the potential of combining symbolic knowledge with deep learning models for comprehensive scene understanding.

Keywords


Scene graph; Knowledge graph; Image captioning; Deep learning; Scene graph generation

References


[1] M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image Captioning using Deep Learning: A Systematic Literature Review,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 11, no. 5, 2020, doi: 10.14569/IJACSA.2020.0110537.

[2] S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault, "Image captioning through image transformer," Proceedings of the Asian conference on computer vision, 2020, doi: 10.48550/arXiv.2004.14231.

[3] X. Yang, H. Zhang, and J. Cai, “Autoencoding and distilling scene graphs for image captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2313-2327, 2020, doi: 10.1109/TPAMI.2020.3042192

[4] R. Li, S. Zhang, D. Lin, K. Chen, and X. He, "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28076-28086, doi: 10.1109/CVPR52733.2024.02652.

[5] W. Zhao and X. Wu, "Boosting entity-aware image captioning with multi-modal knowledge graph," IEEE Transactions on Multimedia, vol. 26, pp. 2659 – 2670, 2023, doi: 10.1109/TMM.2023.3301279.

[6] S. S. Santiesteban, S. Atito, M. Awais, Y. S. Song, and J. Kittler, "Improved Image Captioning Via Knowledge Graph-Augmented Models," ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, doi: 10.1109/ICASSP48485.2024.10447637.

[7] A. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, "A survey on attention-based models for image captioning," International Journal of Advanced Computer Science and Application, vol. 14, no. 2, 2023, doi: 10.14569/IJACSA.2023.0140249.

[8] A. C. Pham, V. Q. Nguyen, T. H. Vuong, and Q. T. Ha, "KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain," arXiv preprint arXiv:2401.08100, 2024, doi: 10.48550/arXiv.2401.08100.

[9] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014, pp. 740–755.

[10] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, and F. F. Li, "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations," Int. J. Comput. Vision, vol. 123, pp. 32-73, 2017.

[11] Y. Cong, M. Y. Yang, and B. Rosenhahn, "RelTR: Relation Transformer for Scene Graph Generation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 11169-11183, 2023, doi: 10.1109/TPAMI.2023.3268066.

[12] A. Osman, M. A. W. Shalaby, and M. M. Soliman, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, 2024, doi: 10.1038/s41598-024-69664-1.

[13] F. Zhao, Z. Yu, T. Wang, and L. Yi, "Image Captioning Based on Semantic Scenes," Entropy, vol. 26, no. 10, 2024, Art. no. 876, doi: 10.3390/e26100876.




DOI: https://doi.org/10.34238/tnu-jst.12614

Refbacks

  • There are currently no refbacks.
TNU Journal of Science and Technology
Rooms 408, 409 - Administration Building - Thai Nguyen University
Tan Thinh Ward - Thai Nguyen City
Phone: (+84) 208 3840 288 - E-mail: jst@tnu.edu.vn
Based on Open Journal Systems
©2018 All Rights Reserved