License: arXiv.org perpetual non-exclusive license
arXiv:2606.28329v1 [cs.IR] 19 May 2026

M3 QuestionIng: Multi-modal Multi-span Medical Question Answering

Anisha Saha Max Planck Institute for Informatics, Saarland Informatics CampusSaarbruckenGermany ansaha@mpi-inf.mpg.de , Vaibhav Rathore Indian Institute of Technology, BombayMumbaiIndia vaibhav.rathore@iitb.ac.in , Abhisek Tiwari Clinical AI AssistanceGurugramIndia , Akash Ghosh Indian Institute of Technology, PatnaPatnaIndia , Sai Ruthvik Edara Yale UniversityUnited States and Sriparna Saha Indian Institute of Technology, PatnaPatnaIndia
Abstract.

The growing adoption of AI in healthcare, particularly in preventive care, highlights the critical need for accessibility and precision in Medical Question Answering (MedQA). In recent years, significant efforts have been made to develop multi-span medical question-answering systems, where the answer to a query may span multiple sections or paragraphs of a source document. However, existing systems fall short of aligning with real-world scenarios, where source documents often include both textual and visual content, requiring answers to incorporate images for better comprehension. To address this gap, we propose M3QAFrameM^{3}QAFrame, a multi-modal, multi-span medical question-answering framework that leverages visual cues to enhance the generation of comprehensive answers drawn from diverse textual and visual spans. The model takes the context, query, and images as input and outputs an answer containing both textual answers and relevant images. The text and image embeddings are processed using a transformer-based architecture to determine the sentence and image relevance. We curate a multi-modal, multi-span medical question-answering (M3QuestionIngM^{3}QuestionIng) dataset containing queries, medical contexts, associated medical images, and extractive answers. Additionally, each query-answer pair is labeled with user intent and query type to enhance query and context comprehension. Extensive experiments show that our approach consistently outperforms existing methods across various evaluation metrics.

Multimodal Learning, Multi-span Question Answering, Medical Question Answering.
copyright: acmlicensedisbn: 978-1-4503-XXXX-X/2018/06ccs: Computing methodologies Neural networksccs: Computing methodologies Information extraction

1. Introduction

Medical Question Answering (MedQA) plays a crucial role in developing intelligent healthcare assistants by identifying relevant information from diverse contexts. With the lack of a sufficient number of healthcare professionals catering to the needs of a growing population (Scheffler and Arnold, 2019; Lorkowski and Jugowicz, 2021), the utilization of Artificial Intelligence (AI) based tools has shown promising directions towards reducing medical workloads efficiently (Jain et al., 2022; Tiwari et al., 2022, 2023b; AlSaad et al., 2024; Tu et al., 2025). Multimodal Medical Question Answering (MMedQA) has gained huge importance over the years due to the availability of medical data in various formats ranging across medical text records, images, video, and audio. The demand for accurate clinical decisions from sensitive medical data aggravates the complexity of this task, and determining a medical ailment from only textual records can be challenging.

Refer to caption
Figure 1. M³QA processes medical queries by extracting relevant sentences and images from clinical documents and an associated pool of images, thereby incorporating visual information to enhance contextual understanding.
Table 1. Key distinctions between VQA, MedVQA, MedQA, MsQA, and Multi-span Multimodal QA (M3M^{3}QA). The table highlights different input modalities and answering styles. VQA focuses on general visual QA, MedQA addresses textual QA in medical contexts, MedVQA extends VQA to the medical domain, MsQA supports multi-span answering, and M3M^{3}QA integrates multimodal fusion with multi-span reasoning.
Task/Feature Image Input Text Input Medical Domain Multi-span Answering Multimodal Fusion
VQA
MedVQA
MedQA
MsQA
M3M^{3}QA (Ours)

Multi-span Question Answering (MsQA) refers to answering a query from multiple non-contiguous spans extracted from a given context document. This is unlike single-span QA, where the answer lies in a continuous segment. In practice, answering a medical query is a complex process that involves integrating information from multiple sources and modalities. Medical knowledge is communicated through diverse formats including textual descriptions, illustrative diagrams, infographic summaries, and annotated figures, particularly in consumer health education materials such as patient documents, medical encyclopaedias and public health resources. Synthesizing information across these heterogeneous modalities is essential for arriving at a comprehensive answer. In this work, we aim to solve this task of Multi-modal, Multi-span, Medical Question Answering (M3QAM^{3}QA). Given a multi-span context C=(s1,s2,sn)C=(s_{1},s_{2},...s_{n}) comprising nn sentences and a set of mm images I=(i1,i2,,im)I=(i_{1},i_{2},...,i_{m}) relevant to the context, M3QAM^{3}QA entails identifying the subset of SS and II which answer a user query QQ. In principle, M3QAM^{3}QA is different from Visual Question Answering (VQA) in various aspects as highlighted in Table 1.

In consumer health materials, visual data often plays a critical role in conveying medical information. For instance, illustrative images such as anatomical diagrams, process infographics, and annotated medical figures frequently provide complementary information that is crucial for fully and accurately answering a medical query. Existing approaches heavily rely on textual information or knowledge graphs to guide relevance prediction (Ben Abacha and Demner-Fushman, 2019; Shen et al., 2020). With the increasing complexity of medical queries, systems need to effectively understand the question’s intent and identify relevant information from diverse modalities, often spanning multiple sentences, contexts, and modalities. QueSemKnow (Tiwari et al., 2023a) highlighted the importance of leveraging query semantics and external knowledge graphs to enhance multi-span question-answering performance. However, these methods have limitations, particularly in their reliance on static external knowledge, which may fail to address dynamically evolving medical contexts or visual information embedded in medical scenarios. By incorporating image data alongside textual information, we hypothesize that the relevance prediction for medical queries can be significantly improved, leading to more accurate and comprehensive answers. Figure 1 shows the relevance of images in answering the given medical query.

To address these issues, we propose a multi-task learning approach M3QAFrameM^{3}QAFrame where a single model is trained to answer questions and identify images relevant to the answer. We hypothesize that the complementary information embedded in images provides a richer representation of the medical context, thereby enhancing the system’s ability to identify relevant sentences. Unlike knowledge graphs, which require extensive curation and may suffer from coverage gaps, visual data inherently captures intricate patterns and features specific to medical contexts. To the best of our knowledge, our work is the first to introduce a dataset (M3QuestionIngM^{3}QuestionIng) and a multi-task framework (M3QAFrameM^{3}QAFrame) to solve multi-modal multi-span medical question answering.

Research Questions: In this paper, we investigate the following research questions related to multimodal multi-span medical question answering (i) Does the inclusion of images in the input space help in better identification of the context sentences that form part of the answer? (ii) Do images contribute unique and complementary information beyond the textual context in medical question answering tasks? (iii) Do existing vision language models (VLM) show better performance in comparison to models specifically trained for multimodal multi-span medical question-answering?

Contributions: The key contributions of this work are as follows:

  • 𝑴𝟑𝑸𝒖𝒆𝒔𝒕𝒊𝒐𝒏𝑰𝒏𝒈\boldsymbol{M^{3}QuestionIng} Corpus We curate a large-scale semantic information annotated multi-span question answering corpus, M3QuestionIngM^{3}QuestionIng, which contains medical contexts, queries, relevant images, intent, and question type for each context-question pair.

  • 𝑴𝟑𝑸𝑨𝑭𝒓𝒂𝒎𝒆\boldsymbol{M^{3}QAFrame} Framework We propose a multimodal multi-task framework M3QAFrameM^{3}QAFrame that integrates image information with text to enhance sentence relevance prediction in extractive medical question-answering, along with identifying images that are relevant to answering those queries.

  • Improved Result Through extensive experiments, we demonstrate the effectiveness of our approach, achieving significant improvements in the evaluation metrics over existing methods (approx. 27%) and over state-of-the-art VLMs (SOTA) (approx. 13%).

2. Related Works

MedQA has been a cornerstone of AI-driven healthcare systems, with significant research focused on improving the understanding and retrieval of relevant information from diverse sources. This section reviews related work in three key areas: Medical question answering, Multi-span question-answering and Multi-modal Multi-span question-answering, which are relevant to the present work.

2.1. Medical Question Answering

Traditional Medical Question Answering approaches primarily rely on textual data to extract answers. Researchers have employed various techniques, including rule-based question classification (Dodiya and Jain, 2016), knowledge abstraction matching (Chen et al., 2019), and probabilistic inference on knowledge graphs (Goodwin and Harabagiu, 2016). More recently, unified encoder-decoder architectures like UniQA (Bae et al., 2021) have been proposed to convert natural language questions into queries. Recent advancements, such as QueSemKnow (Tiwari et al., 2023a), introduced a two-phased framework that incorporates query semantics and knowledge graphs to guide multi-span question answering. However, such reliance on static knowledge graphs often limits adaptability to dynamically evolving medical contexts, highlighting the need for alternative sources of auxiliary information. Transformer-based large language models (LLMs), such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have been widely employed for natural language understanding tasks and has been adopted to the medical domain in models like BioBERT (Lee et al., 2020) and ClinicalBERT (Huang et al., 2019). Domain-specific LLMs like GatorTron (Yang et al., 2022) has been trained to retrieve patient information from unstructured electronic health record, while ClinicalT5 (Lu et al., 2022) has achieved state-of-the-art performance on natural-language oriented tasks on medical documents including document classification, named-entity recognition and natural language inference. However, they exhibit fundamental architectural and training limitations that render them unsuitable for multi-modal, multi-span question answering. GatorTron, while trained on large volumes of unstructured EHRs, is designed primarily for unimodal text retrieval. It operates over clinical notes and lacks any mechanism to jointly reason over heterogeneous inputs. Besides, the architecture does not support cross-modal attention or grounding and it cannot associate a textual question with visual evidence (e.g., answering a question about tumor from a CT scan image). Similarly, ClinicalT5 achieves strong performance on single-span, text-only tasks that require identifying one contiguous answer within a single modality. In contrast, multi-span question-answering requires the model to simultaneously identify and aggregate multiple non-contiguous text spans and synthesize them into a coherent answer. These models achieve remarkable accuracy by leveraging pre-trained representations but are limited to processing textual data, thereby excluding critical visual cues present in medical scenarios.

To enable better clinical reasoning and diagnosis of diseases, multi-modal datasets having radiology images (Lau et al., 2018), electronic health records (Bae et al., 2023) and semantically-labelled images (Liu et al., 2021) has been curated. Multilingual datasets like (Matos et al., 2025) aims to incorporate the knowledge of diverse healthcare environments by introducing 568 QA-image pairs sourced from four countries. Approaches including mixture-of-experts (Jiang et al., 2024a) and retrieval-augmented generation (Xia et al., ) has shown to mitigate hallucination in medical QA task. LlaVA-Med (Li et al., 2024) is one of the first multi-modal LLM instruction-tuned on medical data which achieved strong performance on biomedical VQA benchmarks like VQA-RAD (Lau et al., 2018) and PathVQA (He et al., 2020). Following the trend of leveraging pre-trained knowledge of LLMs for medical QA, strong medical VLMs like MedGemma (Sellergren et al., 2025), HealthGPT (Lin et al., ) and LlaVA-Rad (Zambrano Chaves et al., 2025) has been adopted for multimodal clinical reasoning tasks. However these models struggle to handle cross-modal references along long contexts and multiple images in a single conversation. Since retrieval of essential details from a broader context including diverse images is crucial for diagnostic purposes, we cater to this need by developing a multi-modal multi-span medical question answering dataset and propose a method which extracts text segments and images relevant for answering a given medical query.

2.2. Multi-span Question Answering

Zhu et al. (2020) introduced the concept of multi-span question answering by contributing a dataset. Multi-span question answering has the potential to significantly improve medical question answering by allowing the extraction of complex and precise information from texts. Traditional QA systems often struggle with questions that require answers from multiple, non-consecutive parts of a document. Most methods emphasize understanding query semantics and retrieving relevant sentences or spans from a given context. To address this, datasets like MASH-QA (Zhu et al., 2020) and MultiSpanQA (Li et al., 2022) have been developed to support multi-span question answering. These datasets, along with novel neural architectures like MultiCo (Zhu et al., 2020), can capture the relevance among multiple answer spans and form accurate answers to complex questions.

2.3. Multi-modal Multi-span Question Answering

Recent studies have highlighted the importance of multimodal question answering, where models can reason across multiple modalities, including text, tables, and images. For example, Talmor et al. (2021) introduced the MultiModalQA dataset, which requires joint reasoning over text, tables, and images to answer complex questions. Sun et al. (2023) proposed a novel multimodal question answering framework that unifies multiple information extraction tasks into a unified span extraction and multi-choice QA pipeline, demonstrating significant improvements over the state-of-the-art baselines. Zhi Lim et al. (2024) proposed a three-staged framework which encompasses unified knowledge representation, context retrieval, and answer generation followed by contextual diversity training to improve model robustness by including distractor documents as negative contexts during training.

Although significant progress has been made in text-based MedQA, single-span QA and multimodal learning, combining text and image modalities for MsQA remains underexplored. Existing approaches fail to leverage the complementary nature of visual data in answering complex medical queries. This work addresses these gaps by introducing a multimodal multi-span medical question-answering dataset M3QuestionIngM^{3}QuestionIng and a multimodal framework M3QAFrameM^{3}QAFrame. Our framework accepts large textual contexts and multi-image inputs (unlike most models which lack either of this ability), answers a query that can span across multiple segments of the context, accompanied with visual insights through relevant images, thus improving overall user satisfaction.

Refer to caption
Figure 2. Infographic-style images enhance interpretability, reflecting real-world patterns in consumer health communication.

3. Dataset

We conducted a comprehensive review of existing medical question-answering datasets, and our findings are summarized in Table 2. Datasets such as SLAKE (Liu et al., 2021), VQA-RAD (Lau et al., 2018), and PMC-VQA (Zhang et al., 2023) are designed for textual, visual, or multimodal question answering, where each query is typically answered with a single, continuous text passage or relevant images. However, real-world medical queries often require information dispersed across multiple text segments and images. Despite our investigation, we did not identify any dataset that supports multi-span, multimodal question answering. To address this gap, we introduce M3QuestionIngM^{3}QuestionIng — a novel dataset specifically designed for multi-modal, multi-span medical question answering. M3QuestionIngM^{3}QuestionIng includes structured medical queries, query intents, query types, textual contexts, and associated medical images, providing a more realistic and comprehensive resource for advancing medical question-answering systems. The images in M3QuestionIngM^{3}QuestionIng illustrate how infographic-style visuals enhance interpretability and engagement, reflecting real-world patterns in consumer health communication where text and visuals jointly support comprehension. The role of these images is not diagnostic but didactic—to help users visually understand key ideas described in the answers. This aligns with the consumer health orientation of the QueSeMSpan dataset (Zhu et al., 2020), where queries are typically posed in layperson language (e.g. “How does sunscreen protect the skin?”). To stay true to this setting, we selected open-source medical illustrations and conceptual diagrams (as shown in Figure 2) that explain physiological processes, preventive measures, and general health concepts in an accessible manner.

𝑴𝟑𝑸𝒖𝒆𝒔𝒕𝒊𝒐𝒏𝑰𝒏𝒈\boldsymbol{M^{3}QuestionIng}: Developing a medical dataset is a resource-intensive and sensitive process. Therefore, we opted to build upon an existing benchmark dataset by incorporating the missing features necessary to address the identified gap. We identified the MASHQA dataset, which provides multi-span question-answer pairs but is restricted to textual information. To overcome this limitation, we extend QueSeMSpan (Tiwari et al., 2023a) by introducing both textual and visual question-answering components, thereby creating a more comprehensive dataset for medical question answering.

Table 2. Comparison of existing multi-span medical question-answering datasets.
Dataset #QA Context Intent QA Type Images
HealthQA (Zhu et al., 2019) 8K No No Ranking No
MedQuaD (Ben Abacha and Demner-Fushman, 2019) 47K No No Ranking No
Medication (Abacha et al., 2019) 690 No Yes Ranking No
MASH-QA (Zhu et al., 2020) 35K Yes Yes Extractive No
QueSeMSpan (Tiwari et al., 2023a) 34.8K Yes Yes Extractive No
M3QuestionIng (Ours) 3K Yes Yes Extractive Yes

QueSeMSpan comprises healthcare queries sourced from the WebMD platform 111https://www.webmd.com/, encompassing a diverse range of consumer health topics. These queries are answered by medical practitioners with relevant domain expertise, ensuring reliability and accuracy. Each data instance includes a query, query type, intent, and a context-based answer. To enhance the dataset with visual information, we selected 100 data samples randomly from QueSeMSpan and assigned a medical professional to add five relevant images per context. The process involved the doctor first reviewing the context (a set of paragraphs) to identify key concepts that could be better explained with visual aids. Next, the doctor sourced illustrative images corresponding to these key concepts. The tagged images serve as supplementary material to the textual context, enriching the multimodal nature of the dataset. We ensured that all included images are open-source and free of copyright restrictions.

To scale up the data annotation process, we employed three biology graduates to collect relevant images, tag them with the corresponding contexts, and mark the appropriate images within the answers based on the query requirements. The following guidelines were provided to the annotators to ensure consistency, accuracy, and relevance during the annotation and creation process:

Table 3. Statistics of the M3QuestionIng dataset
Entries Value
# of samples 3012
# of questions annotated with image 2392
# of intents 11
# of query types 12
Avg. context length (in words) 686
Avg. answer length (in words) 67
Avg. image per context 4.78

Step 1: Context Comprehension Understanding the context is crucial for accurate annotation. You should thoroughly read the context to identify critical terms, conditions, or concepts that could benefit from visual support

  • First, carefully read the entire context to grasp the medical information presented.

  • Identify key terms for each paragraph in the context.

  • Select the key terms that may benefit from visual representation.

Step 2: Key Concept Identification Key concept identification helps in selecting images that enhance understanding. This step involves pinpointing important medical terms and concepts within the context.

  • Identify key terms, conditions, or concepts presented in the document.

  • Select a subset of the concepts that require visual aids.

  • Highlight the terms that can be better understood with images.

Step 3: Image Collection and Tagging with Context This step involves collecting and tagging images based on the identified concepts.

  • Images should illustrate the highlighted concepts within the context that enhances understanding.

  • Each context is tagged with five images.

  • Only use images from verified, open-source databases to avoid copyright issues.

  • Ensure image selection is contextually accurate and medically relevant.

Step 4: Image Tagging with Query-Answer Pair Accurate tagging ensures images support the query-answer pairs effectively. This step involves linking images directly with relevant query-answer segments.

  • Tag images to the respective query-answer pairs based on context relevance.

  • Ensure that images are correctly positioned to aid in answering the queries.

  • Each answer can be associated with one to a maximum of five images.

Step 5: Verification and Correction In the final step, the medical professional reviews the annotations to ensure accuracy, consistency, and medical relevance. This step is crucial to maintaining the dataset’s quality.

  • The medical professional verifies that the images accurately represent the medical concepts described in the context.

  • The individual reviews image-query instances where annotator agreement falls below a defined threshold to ensure accuracy. The professional identifies and corrects any errors or inconsistencies, ensuring the dataset is reliable for medical applications.

  • We removed the context points with fewer than five images, and query-answer pairs without relevant images were removed to maintain uniformity.

A sample dataset was created by medical professionals to serve as training material. We then randomly selected around 3,000 samples from QueSeMSpan and instructed the annotators to collect context-relevant images and assign them to the corresponding query-context pairs. To assess the consistency of annotations, we calculated the inter-annotator agreement using the kappa coefficient, which yielded a value of 0.81 — indicating substantial agreement among the annotators. Upon completion of the annotation process, the dataset was reviewed by a medical professional to validate the annotations and make corrections where necessary. This verification step ensured the reliability of the dataset while maintaining consistency and relevance in its visual components. The dataset statistics are reported in Table 3.

Refer to caption
Figure 3. Proposed architecture for multi-modal, multi-span, medical question-answering (M3QAFrameM^{3}QAFrame) model. The system integrates textual context, query, and associated medical images to generate precise and contextually relevant answers. The model leverages visual and textual modalities to address complex queries.

4. Methodology

Problem Formulation: Consider a collection of sentences forming a context C=<s1,s2,,sn>C=<s_{1},s_{2},...,s_{n}>, where sjs_{j} represents jthj^{th} sentence of the context CC having nn sentences and m images I=<i1,i2,..,iim>I=<i_{1},i_{2},..,i_{i_{m}}> associated with the context CC. Given a query QQ, whose answer belongs to the context CC, multi-modal, multi-span question answering refers to identifying the subset of relevant sentences AC=<si,sj,,sk>A_{C}=<s_{i},s_{j},...,s_{k}>, contiguous or non-contiguous, and the subset of relevant images AI=<ia,ib,..,ic>A_{I}=<i_{a},i_{b},..,i_{c}> which answer the given query. For each query in the dataset, corresponding to a context, there is a unique set of sentences (collection of all the relevant sentences) which answers the query. This excludes the possibility of multiple correct answers for a query.

The proposed model architecture (Figure 3) is inspired by the MSQA framework (Tiwari et al., 2023a). We add on a multi-tasking setting where we predict the relevant sentences and images which answer a user query. The model functions in three steps: (i) Text Processing (ii) Image Processing (iii) Fusion and Classification. While our architecture builds on the foundational transformer design (chosen specifically to handle long contexts), it introduces novel enhancements specific to the medical domain: (a) Integration of medical-domain-specific encoders. (b) Customized attention mechanisms and multi-task learning for sentence and image relevance prediction. (c) Focus on complete multi-modal integration both in the input and output.

4.1. Text Processing

The text processing module encodes the query and the context. Given a query Q=<Question,Intent,Type>Q=<Question,Intent,Type> and a context C=<s1,s2,,sn>C=<s_{1},s_{2},...,s_{n}>, where sjs_{j} represents jthj^{th} sentence of the context CC having nn sentences, we concatenate QQ and CC with a special token ([SEP]) to obtain a single text input. Next, the concatenated text is passed through BiomedBERT (Gu et al., 2021) to obtain the text encodings. Since the text contains dense medical terminologies and references, BiomedBERT is chosen to leverage its biomedical domain-specific pretraining.

(1) Ti=BiomedBERT(Qi[SEP]Ci)T_{i}=BiomedBERT(Q_{i}~[SEP]~C_{i})

where TiT_{i} is the ithi^{th} text encoding of the NN data instances. Since the downstream objective is to predict relevant context sentences, each sijs_{ij} (jthj_{th} sentence of ithi_{th} context CiC_{i}) is separated using a special token ([CSEP])

4.2. Image Processing

The image encodings are obtained using Vision Transformer (Dosovitskiy et al., 2021) model trained using the DINO method (Caron et al., 2021). Unlike most models, M3QAFrameM^{3}QAFrame accepts multiple images as input. For justifying the relevance of images in multi-span question answering (RQ1), for all the images Ij=<ii1,ii2,..,iiM>I_{j}=<i_{i1},i_{i2},..,i_{iM}> associated with a given context CiC_{i}, we obtained the average of all the image embeddings:

(2) iim=ViT(iim)i_{im}^{{}^{\prime}}=ViT(i_{im})
(3) Iirel=mean(ii1,ii2,,iiM)I_{i}^{rel}=mean(i_{i1}^{{}^{\prime}},i_{i2}^{{}^{\prime}},...,i_{iM}^{{}^{\prime}})

However, for the image classification task, we obtain embeddings for individual images, concatenate them using a special token ([ISEP]), followed by a classification token ([ICLS]).

(4) Iicls=ii1[ISEP]ii2[ISEP]iiM[ICLS]I_{i}^{cls}=i_{i1}^{{}^{\prime}}~[ISEP]~i_{i2}^{{}^{\prime}}~[ISEP]...i_{iM}^{{}^{\prime}}[ICLS]

where MM denotes the number of images present for the context CiC_{i} and MM varies across samples.

4.3. Fusion and Classification

To obtain the answer to a query, we need to focus on the following key components: (a) Query understanding (b) Textual context understanding (c) Image context understanding (d) Global context provided by text+image. Since the answer should also consist of the relevant images, the problem we solve here is two-headed. The following stages help us capture intricate details from both modalities.

Self Attention Layer: Since the contexts are of varied lengths, we apply a self-attention (Vaswani et al., 2017) to obtain vectors of a fixed length and understand inter-query-context and inter-context semantics, which is calculated as follows:

(5) hij=wstanh(Ws.Cij)h_{ij}=w_{s}tanh(W_{s}.C_{ij}^{{}^{\prime}})
(6) attij=Softmaxi(hij)att_{ij}=Softmax_{i}(h_{ij})
(7) Si=j=1j=kattij.CjS_{i}^{\prime}=\sum_{j=1}^{j=k}att_{ij}.C{j}^{{}^{\prime}}

Here, wsw_{s} and WsW_{s} are learnable parameters, CijC_{ij}^{{}^{\prime}} denotes the encoded representation of jthjth word of ithith sentence of the context and SiS_{i}^{\prime} indicates the attended hidden representation for the ithith sentence for the context CiC_{i}.

Inter-sentence self-attention: Since the number of sentences that are part of the final answer is very small in number with respect to the total number of sentences in the context, the traditional method of attention-weight calculation using softmax is highly likely to suffer from skewness. To mitigate this, we calculate sparsified inter-sentence self-attention α\alpha-entmax (Peters et al., 2019) as follows:

(8) si_saij=ws.tanh(Ws.Sij)si\_sa_{ij}=w_{s}.tanh(W_{s}.S_{ij}^{\prime})
(9) βij=fs(si_saij)\beta_{ij}=f_{s}(si\_sa_{ij})
(10) Si′′=j=1j=kβijSijS_{i}^{\prime\prime}=\sum_{j=1}^{j=k}\beta_{ij}S_{ij}^{\prime}
(11) fs=ReLU[(α1).aτ]1/α1f_{s}=ReLU[(\alpha-1).a-\tau]^{1/\alpha-1}

Here fsf_{s} is a sparse attention function, which unlike traditional softmax pays attention towards relevant sentences over irrelevant ones. βij\beta_{ij} denotes the attention weight of the ithith sentence with respect to the subject jthjth sentence and SijS^{{}^{\prime}}_{ij} is the jthjth word of SiS_{i}^{{}^{\prime}}. The [ICLS] token which captures the image information is concatenated with each of the attended sentences to utilize the image information in sentence relevance prediction.

Sentence and Image Relevance Identification: The attended sentence vectors are now passed through a feed forward neural network. The final layer consists of a softmax function which assigns labels to whether a sentence is relevant or not.

(12) yi=softmax(Wo.Si′′+bo)y_{i}=softmax(W_{o}.S_{i}^{\prime\prime}+b_{o})

Similarly, the image vectors concatenated with the [CLS] token encoding the text information are passed through a feed forward network followed by a softmax layer. Here, W0W_{0} and bob_{o} are the weight and bias term, respectively. The model employs binary cross-entropy loss to backpropagate the discrepancy between the actual data and the model’s predictions. It is calculated as follows:

(13)

loss=j=1j=Ni=1i=n[yi(j)log(yi^(j))+(1yi(j))log(1yi^(j))]loss=-\sum_{j=1}^{j=N}\sum_{i=1}^{i=n}[y_{i}^{(j)}log(\hat{y_{i}}^{(j)})+(1-y_{i}^{(j)})log(1-\hat{y_{i}}^{(j)})]

A similar loss is calculated for image classification. Here, yi(j)y_{i}^{(j)} and yi^(j)\hat{y_{i}}^{(j)} are the predicted and true probabilities, respectively, for the ithi^{th} sentence (image) of the jthj^{th} sample being considered as answer (image relevant to the answer), NN is the total number of samples and nn is the number of sentences (images) in the respective samples.

To ensure effective learning, the total loss is formulated as a weighted combination of the sentence relevance loss (sent\mathcal{L}_{sent}) and the image relevance loss (img\mathcal{L}_{img}), where λs\lambda_{s} and λi\lambda_{i} are the respective weighting coefficients, satisfying λs+λi=1\lambda_{s}+\lambda_{i}=1. The total loss function is given by:

(14) total=λssent+λiimg\mathcal{L}_{total}=\lambda_{s}\cdot\mathcal{L}_{sent}+\lambda_{i}\cdot\mathcal{L}_{img}

5. Experimental Setup

The M3QAFrameM^{3}QAFrame model was trained for 25 epochs on RTX 2080 Ti GPU, with each epoch taking approximately 4 hours. The train, validation, and test set had a division of 80%, 10%, and 10%, respectively. We used grid search to obtain the optimal set of hyperparameters: batch size (8), learning rate α\alpha (0.00003), and optimizer (Adam). We observed that choosing λs=0.7\lambda_{s}=0.7 and λi=0.3\lambda_{i}=0.3 helped the model learn better justifying the fact that extracting relevant sentences from a large context pool is inherently more challenging than selecting a smaller subset of relevant images, justifying the chosen weighting scheme. The reported results are averaged for runs over five random seeds. We provide ablation studies

Baselines: We perform a comparative analysis of our model’s performance over the existing models which have been known to perform well for the task of MsQA in the past. These include BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), TANDA (Garg et al., 2019), MultiCo (Zhu et al., 2020) and QueSemKnow (Tiwari et al., 2023a). Note that since this is the first work on MsQA with both multi-modal inputs and outputs, the above baselines we compare against are unimodal. Due to unavailability of a multi-modal baseline, we evaluated the performance of XLNet (Yang et al., 2019) and Longformer (Beltagy et al., 2020) by switching places with BiomedBERT and keeping the rest of the model pipeline constant.

Table 4. Performance comparison of various MsQA models with respect to M3QAFrameM^{3}QAFrame
Model Accuracy F1-Score Exact Match
BERT 24.82 25.21 8.89
RoBERTA 26.89 28.65 9.40
TANDA 23.48 24.44 8.95
MultiCo 46.21 50.81 17.80
QueSemKnow 58.31 55.81 21.33
XLNet w/o image 38.01 29.19 9.09
XLNet w/ image 39.28 30.67 10.11
Longformer w/o image 61.85 67.13 61.85
Longformer w/ image 65.23 69.17 65.23
M3QAFrameM^{3}QAFrame w/o image 90.51 86.10 90.51
𝐌𝟑𝐐𝐀𝐅𝐫𝐚𝐦𝐞\mathbf{M^{3}QAFrame} w/ image (Ours) 94.34 (29.11 \uparrow) 91.32 (22.15 \uparrow) 94.34 (29.11\uparrow)
Refer to caption
Figure 4. Evaluating the role of images in medical QA using CLIP similarity. Here, visual and textual contexts are aligned to measure semantic relevance and complementary information, informing the development of more accurate multimodal MedQA systems.
Refer to caption
Figure 5. Analyzing textual-image relationships in medical QA using CLIP encoders. Semantic similarity scores (Sim1Sim_{1} and Sim2Sim_{2}) reveal how images complement or reinforce textual details, guiding improved multimodal MedQA approaches.
Figure 6. Comparison of CLIP similarity evaluation (left) and textual-image relationship analysis (right) in medical QA. These figures collectively highlight the role of visual and textual contexts in developing robust multimodal MedQA systems.

6. Results and Discussions

Since sentence and image-relevance identification is a multi-label classification task, we use Accuracy, F1-score and Exact Match to evaluate the performance of the models for multimodal multi-span question answering. For Accuracy and F1-Score, we use the standard Scikit-Learn implementation of accuracy_score and f1_score ((average=‘macro’)) respectively. For Exact Match, we use the exact_match implementation of evaluate library. The score is 1 if two sentences are exactly the same and 0 otherwise. For the reference and predicted answers, the score is averaged across exact matches for individual sentences. We report the performance only on M3QuestionIngM^{3}QuestionIng dataset as it is the first of its kind with no prior dataset having the same structure and components.A mathematical formulation of calculating the above scores are as below:

Let a query QiQ_{i} have a set of reference answer sentences,

(15) 𝒜i={ai1,ai2,,aiki}\mathcal{A}_{i}=\{a_{i}^{1},a_{i}^{2},\ldots,a_{i}^{k_{i}}\}

and a set of predicted answer sentences,

(16) 𝒜^i={a^i1,a^i2,,a^iji}\hat{\mathcal{A}}_{i}=\{\hat{a}_{i}^{1},\hat{a}_{i}^{2},\ldots,\hat{a}_{i}^{j_{i}}\}

where kk is the number of answer sentences (spans) for question qiq_{i}. For each individual sentence pair, the binary accuracy (and exact match) is defined as:

(17) scorei={1if ail=a^im0otherwisescore_{i}=\begin{cases}1&\text{if }a_{i}^{l}=\hat{a}_{i}^{m}\\ 0&\text{otherwise}\end{cases}

The per-query score is the average exact match across its kik_{i} answer sentences:

(18) EMi=Accuracyi=1kip=1kiscorep\text{EM}_{i}=\text{Accuracy}_{i}=\frac{1}{k_{i}}\sum_{p=1}^{k_{i}}score_{p}

The final dataset-level Exact Match / Accuracy score over all NN queries is:

(19) EM=Accuracy=1Ni=1N1kip=1kiscorep\text{EM}=\text{Accuracy}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{k_{i}}\sum_{p=1}^{k_{i}}score_{p}

Precision counts the fraction of predicted sentences that exactly match a reference sentence, and Recall counts the fraction of reference sentences that were exactly predicted. These are computed over matched sentences as:

(20) Precisioni=p=1kiscorep|𝒜^i|Recalli=p=1kiscorep|𝒜i|\text{Precision}_{i}=\frac{\sum_{p=1}^{k_{i}}score_{p}}{|\hat{\mathcal{A}}_{i}|}\qquad\text{Recall}_{i}=\frac{\sum_{p=1}^{k_{i}}score_{p}}{|\mathcal{A}_{i}|}

where |𝒜i||\mathcal{A}_{i}| and |𝒜^i||\hat{\mathcal{A}}_{i}| are the number of reference and predicted sentences respectively for query QiQ_{i}.

The per-question F1 score is:

(21) F1i=2PrecisioniRecalliPrecisioni+Recalli\text{F1}_{i}=\frac{2\cdot\text{Precision}_{i}\cdot\text{Recall}_{i}}{\text{Precision}_{i}+\text{Recall}_{i}}

The dataset-level F1 score over all NN queries is:

(22) F1=1Ni=1NF1i\text{F1}=\frac{1}{N}\sum_{i=1}^{N}\text{F1}_{i}

RQ1: Does inclusion of images in the input space help in better identification of the context sentences which form the answer to a query? The results for sentence-relevance prediction in both the unimodal and multi-modal setting are shown in Table 4. We observe that the inclusion of images demonstrates a significant improvement not only in our proposed model’s performance but also in the case of XLNet and Longformer. It establishes strongly the fact that additional images provide additional information which are utilized by the model to obtain better predictions.

RQ2: Do images contribute unique and complementary information beyond textual context in medical question answering tasks? . To investigate this, we tried to quantify this additional information using CLIP similarity score (Hessel et al., 2021). By analyzing these similarity scores, we can assess how much value images add to answer a query. Two types of similarity are calculated:

  • Context Images & Question: To measure whether the image is semantically relevant to the question. In Figure 5, two context images achieve cosine similarities of 0.35 and 0.39 with the question, providing moderate yet useful visual cues that enhance the model’s predictions.

  • Context Images & Context Text: To measure whether the image adds information complementary to the textual context. Figure 5 presents a scenario where one image closely aligns with the context text, showing a high similarity of approximately 0.71 and thus contributing less novel information. Another image in the same context exhibits a lower similarity of about 0.45, indicating that it provides unique complementary details beyond the textual content.

This similar trend across images in the dataset confirms that these images frequently offer complementary information, particularly when their semantic alignment with the question or context text differs. In cases demanding domain-specific visual reasoning, these additional visual cues improve model performance and lead to more accurate and reliable medical question answering. Thus we conclude that images associated with a multi-span document provide additional context that cannot be derived from text alone.

RQ3: Do existing vision language models (VLM) show better performance in comparison to models specifically trained for multi-modal multi-span medical question-answering? We performed zero-shot and few-shot evaluations on both domain-specific and general purpose state-of-the-art (SOTA) VLMs like Uni-MedCLIP (Khattak et al., 2024), MMEmbed (Lin et al., 2024), VLM2Vec (Jiang et al., 2024b), LLaVA (Liu et al., 2024) and GPT-4o (OpenAI et al., 2024), and report average scores of both the settings due to very low deviation between them. In addition we also fine-tune Qwen2.5 VL (Bai et al., 2025) and MedGemma (Sellergren et al., 2025) for a fair comparison against our proposed architecture trained on M3QuestionIngM^{3}QuestionIng. The prompts used are provided in the Appendix.

Refer to caption
Figure 7. Illustration of the importance of visual in Multi-span Question Answering
Table 5. Comparison of performance between our proposed model M3QAFrameM^{3}QAFrame and SOTA VLMs in the MsQA task.
Model Accuracy F1-Score Exact Match
Uni-MedCLIP 12.74 22.60 11.78
MMEmbed 52.15 66.67 34.59
VLM2Vec 28.05 43.51 27.84
LLaVA 40.56 56.32 40.87
Qwen2.5-VL ((finetuned)) 65.38 5.07 65.38
MedGemma ((finetuned)) 39.16 55.05 39.16
GPT-4o 81.2 78.5 79.5
𝐌𝟑𝐐𝐀𝐅𝐫𝐚𝐦𝐞\mathbf{M^{3}QAFrame} (Ours) 94.34 (13.14 \uparrow) 91.32 (12.82 \uparrow) 94.34 (14.84 \uparrow)

We exclude comparisons with models like MedVQA models, such as PMC-CLIP (Lin et al., 2023), BioViL (Bannur et al., 2023) and LLaVA-Med (Li et al., 2024) because of the following reasons: (i) The length of the contexts in our dataset is huge (Max. words in a context = 2800 ), which cannot be handled by these models. Since, the purpose of our work is answering a query with sentences spanning the entire context, truncating a context to 10-20% of its actual length defeats the whole purpose. (ii) These models do not accept multiple images as input, unlike our model. (iii) They are trained for different downstream tasks. Table 5 illustrates the performance gain in M3QAFrameM^{3}QAFrame’s thereby establishing the need for fine-tuned deep networks for MsQA over SOTA VLMs.

Refer to caption
Figure 8. Example of VLM2Vec’s performance
Table 6. Impact of various elements of 𝐌𝟑𝐐𝐀𝐅𝐫𝐚𝐦𝐞\mathbf{M^{3}QAFrame}
Model Accuracy F1-Score Exact Match
M3QAFrameM^{3}QAFrame w/o Intent 91.75 87.66 91.75
M3QAFrameM^{3}QAFrame w/o Query 92.57 88.20 92.57
M3QAFrameM^{3}QAFrame w/o Self-attention 84.14 80.37 84.14

7. Case Study and Analysis

To analyze the model’s performance in different scenarios, this section details a closer look at the generated outputs, adversarial scenarios where the model fails, and the importance of various building blocks of the model.

  • Impact of images: One of the major motivations of our work is to study how essential are images for MsQA. We observe a significant improvement in our model’s performance when images accompany text as input. An example of this can be found in Figure 7.

  • Performance of VLMs: Figure 8 illustrates the subpar performance of VLMs on the M3QAM^{3}QA task. This can be attributed to the lack of similar structured data in the pre-training datasets. For example, the notably low F1 score (5.07%) observed for the fine-tuned Qwen2.5-VL model can be attributed to the structural mismatch between how the model generates answers and how multi-span answers are evaluated. Qwen2.5-VL, being a generative VLM, tends to produce free-form responses rather than concise discrete answer spans which is a common behavioral tendency of instruction-tuned autoregressive models. As a result, the predicted answer set either contains many sentences that do not exactly match any reference span or are structurally misaligned with the expected multi-span format. This behavior highlights the limitation of adapting VLMs to structured multi-span QA tasks. Besides, these VLMs have not been specifically trained on the downstream task of sentence relevance prediction which is core to M3QAM^{3}QA. Additionally, a majority of the open-source VLMs cannot generate images as output.

  • Impact of architecture components: The ablation study in Table 6 demonstrates that query type is more essential than query intent for answer extraction. It also highlights the importance of self-attention for the model to develop semantic understanding between the query and the large context, owing to inter-sentence reasoning over a large number of pairs.

8. Limitations and Future Work

The proposed dataset and data model provide a foundation towards multi-modal, multi-span medical question answering, which has not been explored much so far. Although the dataset has been created using stratified sampling which reduces systematic bias, it might still favor medical topics that are visually representable (e.g., anatomical or dermatological content). Besides, the model architecture could be improved to precisely cater to the needs of domain-specific understanding. In the current work, we do not use a medical domain-specific vision encoder like MedCLIP (Wang et al., 2022), which could be an addition to the present model in the future. We also observed that although images provide extra information, the present vision encoder cannot properly identify the embedded texts within the image. Further, we use two different frozen text and vision encoders, respectively, and only learn the succeeding layers. In future, we aim to explore sophisticated modality fusion and alignment techniques like Q-Former (Li et al., 2023). We also look forward to exploit the pretrained knowledge of existing vision-language foundation models and customize them to the task of multimodal multi-span question-answering. Our work currently does not involve an image generation module which could help generate images relevant to the answers for illustrative purposes. Future extensions leveraging transformer-based diffusion models could be integrated into the current architecture to enable generating reference images aligned with multi-span answer contexts.

9. Conclusion

In this paper, we try to address the problem of multi-modal multi-span medical question-answering. For a given user query, our approach retrieves a textual answer spanning across multiple segments of a document, along with relevant images that provide a visual insight into the answer, improving user experience. We curated a first-of-its-kind dataset M3QuestionIngM^{3}QuestionIng which contains medical contexts, queries, query-semantic information, relevant images, and extractive answers. We also propose a transformer-based architecture M3QAFrameM^{3}QAFrame for sentence and image relevance identification by leveraging the knowledge obtained from both modalities. Unlike previous works, our model accepts and yields both text and multiple images. We establish how indispensable context images are, as they provide unique and complementary information on top of context documents to improve MsQA. Our model relatively outperforms the existing state-of-the-art architectures for multi-span question answering as well as SOTA VLMs. In our study, we collaborated with medical specialists who found that adding images enhances the overall user experience. Doctors particularly valued images during their second-level verification process, as it gave them a better understanding of the patient’s case. In such scenarios, our model proves to be highly valuable. In the future, we intend to leverage the pre-training of open-source VLMs by fine-tuning them on our dataset and improving the modality fusion technique.

10. Ethical Considerations

Our project introduces a new dataset, M3 QuestionIng. To achieve this, we collaborated with a medical practitioner, who is also a co-author of this paper to ensure accuracy and quality control throughout data collection to validation. To uphold ethical standards, we compensated all volunteers in alignment with government’s minimum wage regulations. We paid utmost attention towards privacy concerns and ensured the dataset was free of any images that could compromise individuals’ privacy. We are deeply committed to ethical principles and the responsible use of AI for social good.

References

  • A. B. Abacha, Y. Mrabet, M. Sharp, T. R. Goodwin, S. E. Shooshan, and D. Demner-Fushman (2019) Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All, pp. 25–29. Cited by: Table 2.
  • R. AlSaad, A. Abd-Alrazaq, S. Boughorbel, A. Ahmed, M. Renault, R. Damseh, and J. Sheikh (2024) Multimodal large language models in health care: applications, challenges, and future outlook. Journal of medical Internet research 26, pp. e59505. Cited by: §1.
  • S. Bae, D. Kim, J. Kim, and E. Choi (2021) Question answering for complex electronic health records database using unified encoder-decoder architecture. In Machine learning for health, pp. 13–25. Cited by: §2.1.
  • S. Bae, D. Kyung, J. Ryu, E. Cho, G. Lee, S. Kweon, J. Oh, L. Ji, E. Chang, T. Kim, et al. (2023) Ehrxqa: a multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems 36, pp. 3867–3880. Cited by: §2.1.
  • S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin (2025) Qwen2.5-vl technical report. External Links: 2502.13923, Link Cited by: §6.
  • S. Bannur, S. Hyland, Q. Liu, F. Perez-Garcia, M. Ilse, D. C. Castro, B. Boecking, H. Sharma, K. Bouzid, A. Thieme, et al. (2023) Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15016–15027. Cited by: §6.
  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §5.
  • A. Ben Abacha and D. Demner-Fushman (2019) A question-entailment approach to question answering. BMC bioinformatics 20, pp. 1–23. Cited by: §1, Table 2.
  • M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 9630–9640. External Links: Document Cited by: §4.2.
  • J. Chen, J. Zhou, Z. Shi, B. Fan, and C. Luo (2019) Knowledge abstraction matching for medical question answering. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Vol. , pp. 342–347. External Links: Document Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.1, §5.
  • T. Dodiya and S. Jain (2016) Question classification for medical domain question answering system. In 2016 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE), Vol. , pp. 204–207. External Links: Document Cited by: §2.1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. External Links: 2010.11929, Link Cited by: §4.2.
  • S. Garg, T. Vu, and A. Moschitti (2019) TANDA: transfer and adapt pre-trained transformer models for answer sentence selection. In AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §5.
  • T. R. Goodwin and S. M. Harabagiu (2016) Medical question answering for clinical decision support. In Proceedings of the 25th ACM international on conference on information and knowledge management, pp. 297–306. Cited by: §2.1.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2021) Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3 (1), pp. 1–23. External Links: ISSN 2637-8051, Link, Document Cited by: §4.1.
  • X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie (2020) Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286. Cited by: §2.1.
  • J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi (2021) CLIPScore: a reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.), Online and Punta Cana, Dominican Republic, pp. 7514–7528. External Links: Link, Document Cited by: §6.
  • K. Huang, J. Altosaar, and R. Ranganath (2019) Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: §2.1.
  • R. Jain, A. Jangra, S. Saha, and A. Jatowt (2022) A survey on medical document summarization. arXiv preprint arXiv:2212.01669. Cited by: §1.
  • S. Jiang, T. Zheng, Y. Zhang, Y. Jin, L. Yuan, and Z. Liu (2024a) Med-moe: mixture of domain-specific experts for lightweight medical vision-language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 3843–3860. Cited by: §2.1.
  • Z. Jiang, R. Meng, X. Yang, S. Yavuz, Y. Zhou, and W. Chen (2024b) Vlm2vec: training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160. Cited by: §6.
  • M. U. Khattak, S. Kunhimon, M. Naseer, S. Khan, and F. S. Khan (2024) UniMed-clip: towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372. Cited by: §6.
  • J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman (2018) A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5 (1), pp. 1–10. Cited by: §2.1, §3.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §2.1.
  • C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao (2024) Llava-med: training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36. Cited by: §2.1, §6.
  • H. Li, M. Tomko, M. Vasardani, and T. Baldwin (2022) MultiSpanQA: a dataset for multi-span question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1250–1260. Cited by: §2.2.
  • J. Li, D. Li, S. Savarese, and S. Hoi (2023) BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. External Links: 2301.12597, Link Cited by: §8.
  • S. Lin, C. Lee, M. Shoeybi, J. Lin, B. Catanzaro, and W. Ping (2024) Mm-embed: universal multimodal retrieval with multimodal llms. arXiv preprint arXiv:2411.02571. Cited by: §6.
  • [30] T. Lin, W. Zhang, S. LI, Y. Yuan, B. Yu, H. Li, W. He, H. Jiang, M. Li, S. Tang, et al. HealthGPT: a medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation. In Forty-second International Conference on Machine Learning, Cited by: §2.1.
  • W. Lin, Z. Zhao, X. Zhang, C. Wu, Y. Zhang, Y. Wang, and W. Xie (2023) Pmc-clip: contrastive language-image pre-training using biomedical documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 525–536. Cited by: §6.
  • B. Liu, L. Zhan, L. Xu, L. Ma, Y. Yang, and X. Wu (2021) SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. External Links: 2102.09542, Link Cited by: §2.1, §3.
  • H. Liu, C. Li, Q. Wu, and Y. J. Lee (2024) Visual instruction tuning. Advances in neural information processing systems 36. Cited by: §6.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692, Link Cited by: §2.1, §5.
  • J. Lorkowski and A. Jugowicz (2021) Shortage of physicians: a critical review. Medical Research and Innovation, pp. 57–62. Cited by: §1.
  • Q. Lu, D. Dou, and T. Nguyen (2022) ClinicalT5: a generative language model for clinical text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5436–5443. Cited by: §2.1.
  • J. Matos, S. Chen, S. K. V. Placino, Y. Li, J. C. C. Pardo, D. Idan, T. Tohyama, D. Restrepo, L. F. Nakayama, J. M. M. Pascual-Leone, G. K. Savova, H. Aerts, L. A. Celi, A. I. Wong, D. Bitterman, and J. Gallifant (2025) WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation. In Findings of the Association for Computational Linguistics: NAACL 2025, L. Chiruzzo, A. Ritter, and L. Wang (Eds.), Albuquerque, New Mexico, pp. 7203–7216. External Links: Link, Document, ISBN 979-8-89176-195-7 Cited by: §2.1.
  • OpenAI, :, A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, A. Mądry, A. Baker-Whitcomb, A. Beutel, A. Borzunov, A. Carney, A. Chow, A. Kirillov, A. Nichol, A. Paino, A. Renzin, A. T. Passos, A. Kirillov, A. Christakis, A. Conneau, A. Kamali, A. Jabri, A. Moyer, A. Tam, A. Crookes, A. Tootoochian, A. Tootoonchian, A. Kumar, A. Vallone, A. Karpathy, A. Braunstein, A. Cann, A. Codispoti, A. Galu, A. Kondrich, A. Tulloch, A. Mishchenko, A. Baek, A. Jiang, A. Pelisse, A. Woodford, A. Gosalia, A. Dhar, A. Pantuliano, A. Nayak, A. Oliver, B. Zoph, B. Ghorbani, B. Leimberger, B. Rossen, B. Sokolowsky, B. Wang, B. Zweig, B. Hoover, B. Samic, B. McGrew, B. Spero, B. Giertler, B. Cheng, B. Lightcap, B. Walkin, B. Quinn, B. Guarraci, B. Hsu, B. Kellogg, B. Eastman, C. Lugaresi, C. Wainwright, C. Bassin, C. Hudson, C. Chu, C. Nelson, C. Li, C. J. Shern, C. Conger, C. Barette, C. Voss, C. Ding, C. Lu, C. Zhang, C. Beaumont, C. Hallacy, C. Koch, C. Gibson, C. Kim, C. Choi, C. McLeavey, C. Hesse, C. Fischer, C. Winter, C. Czarnecki, C. Jarvis, C. Wei, C. Koumouzelis, D. Sherburn, D. Kappler, D. Levin, D. Levy, D. Carr, D. Farhi, D. Mely, D. Robinson, D. Sasaki, D. Jin, D. Valladares, D. Tsipras, D. Li, D. P. Nguyen, D. Findlay, E. Oiwoh, E. Wong, E. Asdar, E. Proehl, E. Yang, E. Antonow, E. Kramer, E. Peterson, E. Sigler, E. Wallace, E. Brevdo, E. Mays, F. Khorasani, F. P. Such, F. Raso, F. Zhang, F. von Lohmann, F. Sulit, G. Goh, G. Oden, G. Salmon, G. Starace, G. Brockman, H. Salman, H. Bao, H. Hu, H. Wong, H. Wang, H. Schmidt, H. Whitney, H. Jun, H. Kirchner, H. P. de Oliveira Pinto, H. Ren, H. Chang, H. W. Chung, I. Kivlichan, I. O’Connell, I. O’Connell, I. Osband, I. Silber, I. Sohl, I. Okuyucu, I. Lan, I. Kostrikov, I. Sutskever, I. Kanitscheider, I. Gulrajani, J. Coxon, J. Menick, J. Pachocki, J. Aung, J. Betker, J. Crooks, J. Lennon, J. Kiros, J. Leike, J. Park, J. Kwon, J. Phang, J. Teplitz, J. Wei, J. Wolfe, J. Chen, J. Harris, J. Varavva, J. G. Lee, J. Shieh, J. Lin, J. Yu, J. Weng, J. Tang, J. Yu, J. Jang, J. Q. Candela, J. Beutler, J. Landers, J. Parish, J. Heidecke, J. Schulman, J. Lachman, J. McKay, J. Uesato, J. Ward, J. W. Kim, J. Huizinga, J. Sitkin, J. Kraaijeveld, J. Gross, J. Kaplan, J. Snyder, J. Achiam, J. Jiao, J. Lee, J. Zhuang, J. Harriman, K. Fricke, K. Hayashi, K. Singhal, K. Shi, K. Karthik, K. Wood, K. Rimbach, K. Hsu, K. Nguyen, K. Gu-Lemberg, K. Button, K. Liu, K. Howe, K. Muthukumar, K. Luther, L. Ahmad, L. Kai, L. Itow, L. Workman, L. Pathak, L. Chen, L. Jing, L. Guy, L. Fedus, L. Zhou, L. Mamitsuka, L. Weng, L. McCallum, L. Held, L. Ouyang, L. Feuvrier, L. Zhang, L. Kondraciuk, L. Kaiser, L. Hewitt, L. Metz, L. Doshi, M. Aflak, M. Simens, M. Boyd, M. Thompson, M. Dukhan, M. Chen, M. Gray, M. Hudnall, M. Zhang, M. Aljubeh, M. Litwin, M. Zeng, M. Johnson, M. Shetty, M. Gupta, M. Shah, M. Yatbaz, M. J. Yang, M. Zhong, M. Glaese, M. Chen, M. Janner, M. Lampe, M. Petrov, M. Wu, M. Wang, M. Fradin, M. Pokrass, M. Castro, M. O. T. de Castro, M. Pavlov, M. Brundage, M. Wang, M. Khan, M. Murati, M. Bavarian, M. Lin, M. Yesildal, N. Soto, N. Gimelshein, N. Cone, N. Staudacher, N. Summers, N. LaFontaine, N. Chowdhury, N. Ryder, N. Stathas, N. Turley, N. Tezak, N. Felix, N. Kudige, N. Keskar, N. Deutsch, N. Bundick, N. Puckett, O. Nachum, O. Okelola, O. Boiko, O. Murk, O. Jaffe, O. Watkins, O. Godement, O. Campbell-Moore, P. Chao, P. McMillan, P. Belov, P. Su, P. Bak, P. Bakkum, P. Deng, P. Dolan, P. Hoeschele, P. Welinder, P. Tillet, P. Pronin, P. Tillet, P. Dhariwal, Q. Yuan, R. Dias, R. Lim, R. Arora, R. Troll, R. Lin, R. G. Lopes, R. Puri, R. Miyara, R. Leike, R. Gaubert, R. Zamani, R. Wang, R. Donnelly, R. Honsby, R. Smith, R. Sahai, R. Ramchandani, R. Huet, R. Carmichael, R. Zellers, R. Chen, R. Chen, R. Nigmatullin, R. Cheu, S. Jain, S. Altman, S. Schoenholz, S. Toizer, S. Miserendino, S. Agarwal, S. Culver, S. Ethersmith, S. Gray, S. Grove, S. Metzger, S. Hermani, S. Jain, S. Zhao, S. Wu, S. Jomoto, S. Wu, Shuaiqi, Xia, S. Phene, S. Papay, S. Narayanan, S. Coffey, S. Lee, S. Hall, S. Balaji, T. Broda, T. Stramer, T. Xu, T. Gogineni, T. Christianson, T. Sanders, T. Patwardhan, T. Cunninghman, T. Degry, T. Dimson, T. Raoux, T. Shadwell, T. Zheng, T. Underwood, T. Markov, T. Sherbakov, T. Rubin, T. Stasi, T. Kaftan, T. Heywood, T. Peterson, T. Walters, T. Eloundou, V. Qi, V. Moeller, V. Monaco, V. Kuo, V. Fomenko, W. Chang, W. Zheng, W. Zhou, W. Manassra, W. Sheu, W. Zaremba, Y. Patil, Y. Qian, Y. Kim, Y. Cheng, Y. Zhang, Y. He, Y. Zhang, Y. Jin, Y. Dai, and Y. Malkov (2024) GPT-4o system card. External Links: 2410.21276, Link Cited by: §6.
  • B. Peters, V. Niculae, and A. F. T. Martins (2019) Sparse sequence-to-sequence models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez (Eds.), Florence, Italy, pp. 1504–1519. External Links: Link, Document Cited by: §4.3.
  • R. M. Scheffler and D. R. Arnold (2019) Projecting shortages and surpluses of doctors and nurses in the oecd: what looms ahead. Health Economics, Policy and Law 14 (2), pp. 274–290. Cited by: §1.
  • A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. (2025) Medgemma technical report. arXiv preprint arXiv:2507.05201. Cited by: §2.1, §6.
  • S. Shen, Y. Li, N. Du, X. Wu, Y. Xie, S. Ge, T. Yang, K. Wang, X. Liang, and W. Fan (2020) On the generation of medical question-answer pairs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8822–8829. Cited by: §1.
  • Y. Sun, K. Zhang, and Y. Su (2023) Multimodal question answering for unified information extraction. External Links: 2310.03017, Link Cited by: §2.3.
  • A. Talmor, O. Yoran, A. Catav, D. Lahav, Y. Wang, A. Asai, G. Ilharco, H. Hajishirzi, and J. Berant (2021) MultiModalQA: complex question answering over text, tables and images. External Links: 2104.06039, Link Cited by: §2.3.
  • A. Tiwari, A. Bhansali, S. Saha, P. Bhattacharyya, P. Verma, and M. Dhar (2023a) Local context is not enough! towards query semantic and knowledge guided multi-span medical question answering.. In ECAI, pp. 2354–2361. Cited by: §1, §2.1, Table 2, §3, §4, §5.
  • A. Tiwari, M. Manthena, S. Saha, P. Bhattacharyya, M. Dhar, and S. Tiwari (2022) Dr. can see: towards a multi-modal disease diagnosis virtual assistant. In Proceedings of the 31st ACM international conference on information & knowledge management, pp. 1935–1944. Cited by: §1.
  • A. Tiwari, A. Saha, S. Saha, P. Bhattacharyya, and M. Dhar (2023b) Experience and evidence are the eyes of an excellent summarizer! towards knowledge infused multi-modal clinical conversation summarization. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 2452–2461. Cited by: §1.
  • T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y. Cheng, et al. (2025) Towards conversational diagnostic artificial intelligence. Nature, pp. 1–9. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 6000–6010. External Links: ISBN 9781510860964 Cited by: §4.3.
  • Z. Wang, Z. Wu, D. Agarwal, and J. Sun (2022) MedCLIP: contrastive learning from unpaired medical images and text. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2022, pp. 3876–3887. External Links: Link Cited by: §8.
  • [51] P. Xia, K. Zhu, H. Li, T. Wang, W. Shi, S. Wang, L. Zhang, J. Zou, and H. Yao MMed-rag: versatile multimodal rag system for medical vision language models. Cited by: §2.1.
  • X. Yang, A. Chen, N. PourNejatian, H. C. Shin, K. E. Smith, C. Parisien, C. Compas, C. Martin, M. G. Flores, Y. Zhang, et al. (2022) Gatortron: a large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540. Cited by: §2.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §5.
  • J. M. Zambrano Chaves, S. Huang, Y. Xu, H. Xu, N. Usuyama, S. Zhang, F. Wang, Y. Xie, M. Khademi, Z. Yang, et al. (2025) A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings. Nature Communications 16 (1), pp. 3108. Cited by: §2.1.
  • X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie (2023) Pmc-vqa: visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415. Cited by: §3.
  • Q. Zhi Lim, C. Poo Lee, K. Ming Lim, and A. Kamsani Samingan (2024) UniRaG: unification, retrieval, and generation for multimodal question answering with pre-trained language models. IEEE Access 12 (), pp. 71505–71519. External Links: Document Cited by: §2.3.
  • M. Zhu, A. Ahuja, D. Juan, W. Wei, and C. K. Reddy (2020) Question answering with long multiple-span answers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3840–3849. Cited by: §2.2, Table 2, §3, §5.
  • M. Zhu, A. Ahuja, W. Wei, and C. K. Reddy (2019) A hierarchical attention retrieval model for healthcare question answering. In The World Wide Web Conference, pp. 2472–2482. Cited by: Table 2.

11. Appendix

Zero-Shot Prompt

Prompt: You are given a set of context sentences and a set of images. Your task is to identify the most relevant sentences and images that answer the question. CONTEXT: {context} IMAGES: {image_set} QUESTION: {question} TASK: List the sentences most relevant to answering the question. List the images most relevant to answering the question.

Few-Shot Prompt

Prompt: You are a medical expert whose task is to analyze the given context sentences and medical images to answer the question. Guidance Example: {example} Context: {context} Images: {image_set} Question: {question} Relevant Sentences: {sentences} Relevant Images: {images} TASK: For the given context: {context}, images: {image_set}, and question: {question}, identify the most relevant sentences and images that answer the question as per the {example}.