AN IMAGE CAPTIONING MODEL INTEGRATING KNOWLEDGE GRAPHS AND DEEP LEARNING

Nguyễn Đỗ Thái Nguyên; Nguyễn Văn Tuấn; Nguyễn Ngọc Phú Tỷ; Nguyễn Hữu Minh Quân

doi:10.34238/tnu-jst.12614

AN IMAGE CAPTIONING MODEL INTEGRATING KNOWLEDGE GRAPHS AND DEEP LEARNING

About this article

Received: 17/04/25 Revised: 16/06/25 Published: 27/06/25

Authors

1. Nguyen Do Thai Nguyen , Ho Chi Minh City University of Education
2. Nguyen Van Tuan, Ho Chi Minh City University of Education
3. Nguyen Ngoc Phu Ty, Ho Chi Minh City University of Education
4. Nguyen Huu Minh Quan, Ho Chi Minh City University of Education

Abstract

This study proposes a novel image captioning model that integrates knowledge graphs and deep learning to enhance semantic understanding and generate more accurate image descriptions. The research aims to address the limitations of conventional captioning approaches that often overlook the relationships between entities within an image. Our method involves generating scene graphs from input images, which are then enriched with external knowledge from structured knowledge graphs to generate semantically rich captions. The model is trained and evaluated on standard datasets, including MSCOCO and Visual Genome. Experimental results demonstrate that the proposed model outperforms existing baselines in terms of BLEU 41.3 and METEOR 31.6, especially in complex scenes with multiple entities. Furthermore, the use of knowledge graph augmentation significantly improves the contextual relevance and informativeness of the generated captions. This research contributes to advancing multi-objects image captioning and highlights the potential of combining symbolic knowledge with deep learning models for comprehensive scene understanding.

Keywords

Scene graph; Knowledge graph; Image captioning; Deep learning; Scene graph generation

Full Text:

PDF (Tiếng Việt)

References

[1] M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image Captioning using Deep Learning: A Systematic Literature Review,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 11, no. 5, 2020, doi: 10.14569/IJACSA.2020.0110537.

[2] S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault, "Image captioning through image transformer," Proceedings of the Asian conference on computer vision, 2020, doi: 10.48550/arXiv.2004.14231.

[3] X. Yang, H. Zhang, and J. Cai, “Autoencoding and distilling scene graphs for image captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2313-2327, 2020, doi: 10.1109/TPAMI.2020.3042192

[4] R. Li, S. Zhang, D. Lin, K. Chen, and X. He, "From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28076-28086, doi: 10.1109/CVPR52733.2024.02652.

[5] W. Zhao and X. Wu, "Boosting entity-aware image captioning with multi-modal knowledge graph," IEEE Transactions on Multimedia, vol. 26, pp. 2659 – 2670, 2023, doi: 10.1109/TMM.2023.3301279.

[6] S. S. Santiesteban, S. Atito, M. Awais, Y. S. Song, and J. Kittler, "Improved Image Captioning Via Knowledge Graph-Augmented Models," ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, doi: 10.1109/ICASSP48485.2024.10447637.

[7] A. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, "A survey on attention-based models for image captioning," International Journal of Advanced Computer Science and Application, vol. 14, no. 2, 2023, doi: 10.14569/IJACSA.2023.0140249.

[8] A. C. Pham, V. Q. Nguyen, T. H. Vuong, and Q. T. Ha, "KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain," arXiv preprint arXiv:2401.08100, 2024, doi: 10.48550/arXiv.2401.08100.

[9] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014, pp. 740–755.

[10] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, and F. F. Li, "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations," Int. J. Comput. Vision, vol. 123, pp. 32-73, 2017.

[11] Y. Cong, M. Y. Yang, and B. Rosenhahn, "RelTR: Relation Transformer for Scene Graph Generation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 11169-11183, 2023, doi: 10.1109/TPAMI.2023.3268066.

[12] A. Osman, M. A. W. Shalaby, and M. M. Soliman, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, 2024, doi: 10.1038/s41598-024-69664-1.

[13] F. Zhao, Z. Yu, T. Wang, and L. Yi, "Image Captioning Based on Semantic Scenes," Entropy, vol. 26, no. 10, 2024, Art. no. 876, doi: 10.3390/e26100876.

DOI: https://doi.org/10.34238/tnu-jst.12614

Refbacks

There are currently no refbacks.



Remember me