Multimodal RAG   using text and visual data

M.H. Shevchenko; M.V. Androshchuk

doi:10.15407/pp2025.03.066

Multimodal RAG using text and visual data

M.H. Shevchenko, M.V. Androshchuk

Abstract

This paper presents the development and investigation of a multimodal Retrieval-Augmented Generation system designed for the analysis and interpretation of medical images. The research focuses on chest X-ray images and their corresponding radiology reports. The primary goal was to create a system capable of performing two key tasks: generating a detailed radiology report for an input image and providing accurate answers to specific ques tions about it. A secondary goal was to demonstrate that employing a multimodal retrieval-augmented approach significantly improves generation quality compared to using large multimodal models without a retrieval com ponent. The system's implementation utilizes a combination of state-of-the-art deep learning models. The Bio medCLIP model, fine-tuned on the target dataset, was used to generate vector embeddings for both text and visual data. The generator component is based on the large language model LLaVA-Med 1.5, which is adapted for the medical domain and quantized to operate under limited computational resources. The system architecture also includes auxiliary classifiers based on DenseNet121 to determine the image projection and identify clinical findings, thereby enhancing retrieval accuracy. The experimental evaluation involved testing six different con figurations of the developed system. The evaluation was conducted using a range of metrics, including accuracy and F1-score for the question-answering task, as well as BLEU, ROUGE, F1-CheXbert, and F1-RadGraph for assessing the quality of the generated reports. The test results demonstrated a significant advantage of all system configurations over the baseline generator model. The best results were achieved by the configuration that uti lizes projection and clinical finding classifiers with an exact match requirement for the identified pathologies. The study confirmed that integrating a relevant data retrieval mechanism significantly enhances both the struc tural and semantic quality of the generated textual descriptions for medical images.

Problems in programming 2025; 3: 66-78

Keywords

Retrieval-Augmented Generation; multimodality; medical imaging; report generation; deep learning; large language models

Full Text:

PDF (Українська)

References

Belcic, I. and Stryker, C. (n.d.) What is Learning Rate in Machine Learning?. IBM.

Bergmann, D. and Stryker, C. (n.d.) What is Loss Function?. IBM.

Chen, W. et al. (2022) MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. CrossRef

Coursera (n.d.) What Does Batch Size Mean in Deep Learning? An In-Depth Guide.

Coursera (n.d.) What Is an Epoch in Machine Learning?

DeepAI (n.d.) Harmonic Mean.

DeepAI (n.d.) Hyperparameter.

Delbrouck, J.-B. et al. (2022) Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards. CrossRef

Demner-Fushman, D. et al. (2015) "Preparing a collection of radiology examinations for distribution and retrieval", Journal of the American Medical Informatics Association, 23(2), pp. 304-310. CrossRef

Deng, J. et al. (2009) "ImageNet: A large-scale hierarchical image database", in 2009 IEEE Conference on Computer Vision and Pattern.

Recognition. Miami, Florida, 20-25 June. IEEE, pp. 248-255. CrossRef

Hugging Face (n.d.) What is Image Classification?

IBM (n.d.) IBM Watson Studio and Knowledge Catalog.

IBM (n.d.) What are Convolutional Neural Networks?

Irvin, J. et al. (2019) CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. CrossRef

Jain, S. et al. (2021) RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.

Johnson, A. E. W. et al. (2019) MIMIC-CXR JPG, a large publicly available database of labeled chest radiographs.

Kingma, D. P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization.

Li, C. et al. (2023) LLaVA-Med: Training a Large Language-and Vision Assistant for Biomedicine in One Day. CrossRef

Lin, C.-Y. (2004) ROUGE: A Package for Automatic Evaluation of Summaries, in Text Summarization Branches Out. Barcelona, Spain, 25-26 July. ACL, pp. 74-81.

Loshchilov, I. and Hutter, F. (2017) Decoupled Weight Decay Regularization.

Mao, A., Mohri, M. and Zhong, Y. (2023). Cross-Entropy Loss Functions: Theoretical Analysis and Applications.

Nagel, M. et al. (2021) A White Paper on Neural Network Quantization.

Papineni, K. et al. (2002) 'Bleu: a Method for Automatic Evaluation of Machine Translation', in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, 7-12 July. ACL, pp. 311-318. CrossRef

PhysioNet (2023) MIMIC-CXR Database.

Radford, A. et al. (2021) Learning Transferable Visual Models From Natural Language Supervision.

Read, J. and Perez-Cruz, F. (2015) Deep Learning for Multi-label Classification.

Ruder, S. (2016) An overview of gradient descent optimization algorithms.

Smit, A. et al. (2020) CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. CrossRef

Sun, S. et al. (2019) A Survey of Optimization Methods from a Machine Learning Perspective.

Wood, T. (n.d.) F-Score. DeepAI.

Wood, T. (n.d.) Precision and Recall. DeepAI.

Xia, P. et al. (2024) MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models. CrossRef

Zhang, S. et al. (2023) BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image text pairs.

Zhao, R. et al. (2023) Retrieving Multimodal Information for Augmented Generation: A Survey. CrossRef

DOI: https://doi.org/10.15407/pp2025.03.066

Refbacks

There are currently no refbacks.

Username
Password
Remember me