Evaluation of a Semantic Representation-Based Retrieval Model on a Text Dataset Generated from Image Transformation

Muhammad Firmansyah(1*), Dhendra Marutho(2), Ahmad Ilham(3), Irwansyah Saputra(4)


(1) Ilmu komputer, Fakultas Teknologi Informasi, Universitas Nusa Mandiri, Jakarta, Indonesia
(2) Informatika, Universitas Muhammadiyah Semarang
(3) Informatika, Universitas Muhammadiyah Semarang
(4) Ilmu komputer, Fakultas Teknologi Informasi, Universitas Nusa
(*) Corresponding Author

Abstract


The increasing demand for efficient multimodal information retrieval has driven significant research into bridging visual and textual data. While sophisticated models like CLIP offer state-of-the-art semantic alignment, their substantial computational requirements present challenges for deployment in resource-constrained environments. This study introduces a lightweight retrieval framework that leverages the BLIP image captioning model to transform image data into rich textual descriptions, effectively reframing cross-modal retrieval as a text-to-text task. We systematically evaluated three retrieval models BM25, SBERT, and T5 on caption-transformed MSCOCO and Flickr30K datasets, utilizing both classical metrics (Recall@5, mAP) and semantic-aware metrics (SAR@5, Semantic mAP). Experimental results demonstrate that T5 achieves superior semantic performance (SAR@5 = 0.561, Semantic mAP = 0.524), surpassing SBERT (SAR@5 = 0.524) and outperforming the lexical BM25 baseline (SAR@5 = 0.312). Notably, the proposed BLIP+T5 pipeline attains 88% of CLIP’s semantic accuracy while reducing inference latency by approximately 60% and decreasing GPU memory consumption by over 60%. These findings underscore the potential of caption-based retrieval frameworks as scalable, cost-effective alternatives to computationally intensive multimodal systems, especially in latency-sensitive and resource-limited scenarios. Future work will explore fine-tuning strategies, domain-adapted semantic metrics, and robustness under real-world conditions to further advance retrieval effectiveness.


Keywords


Information Retrieval; Semantic Embedding; BLIP; SBERT; Zero-shot; Text Retrieval

References


Rasiwasia, Nikhil & Costa Pereira, Jose & Coviello, Emanuele & Doyle, Gabriel & Lanckriet, Gert & Levy, Roger & Vasconcelos, Nuno. (2010). A New Approach to Cross-Modal Multimedia Retrieval. MM'10 - Proceedings of the ACM Multimedia 2010 International Conference. 251-260. 10.1145/1873951.1873987.

Beltrán, Viviana & Caicedo, Juan & Journet, Nicholas & Coustaty, Mickaël & Lecellier, François & Doucet, Antoine. (2021). Deep Multimodal learning for Cross-Modal Retrieval: one model for all tasks. Pattern Recognition Letters. 146. 10.1016/j.patrec.2021.02.021.

Ngiam, Jiquan & Khosla, Aditya & Kim, Mingyu & Nam, Juhan & Lee, Honglak & Ng, Andrew. (2011). Multimodal Deep Learning. Proceedings of the 28th International Conference on Machine Learning, ICML 2011. 689-696.

Kiros, R., Salakhutdinov, R., & Zemel, R.S. (2014). Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. ArXiv, abs/1411.2539.

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning.

Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. International Conference on Machine Learning.

Li, J., Li, D., Xiong, C., & Hoi, S.C. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning.

Goyal, K., Gupta, U., De, A., & Chakrabarti, S. (2020, July). Deep neural matching models for graph retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1701-1704).

Yifan Gao, Qingyu Yin, Zheng Li, Rui Meng, Tong Zhao, Bing Yin, Irwin King, and Michael Lyu (2022). Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1233–1246, Seattle, United States. Association for Computational Linguistics.

Zhang, C., Raghu, M., Kleinberg, J.M., & Bengio, S. (2021). Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. ArXiv, abs/2107.12580.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Conference on Empirical Methods in Natural Language Processing.

Raffel, C., Shazeer, N.M., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21, 140:1-140:67. arXiv:1910.10683

Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common Objects in Context. European Conference on Computer Vision. arXiv:1405.0312

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision, 123, 74 - 93. arXiv:1505.04870

Wray, M., Doughty, H., & Damen, D. (2021). On Semantic Similarity in Video Retrieval. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3649-3659. arXiv:2103.10095

Hubert, N., Monnin, P., Brun, A., & Monticolo, D. (2023). Sem@K: Is my knowledge graph embedding model semantic-aware? ArXiv,abs/2301.05601. arXiv:2301.05601


Article Metrics

Abstract view : 15 times


DOI: https://doi.org/10.26714/jichi.v6i2.19240

Refbacks

  • There are currently no refbacks.


____________________________________________________________________________
Journal of Intelligent Computing and Health Informatics (JICHI)
ISSN 2715-6923 (print) | 2721-9186 (online)
Organized by
Department of Informatics
Faculty of Engineering
Universitas Muhammadiyah Semarang

W : https://jurnal.unimus.ac.id/index.php/ICHI
E : [email protected], [email protected]

View My Stats

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.