A Multi-Chatbot Evaluation Framework for Knee MRI Diagnosis Assistance

Uzunbayir, SerhatTerziler, Deniz ErenGedizlioglu, CinarAltuner, Beyza2026-03-272026-03-272025-10-26979833155565897983315556652687-7775https://hdl.handle.net/20.500.14365/8883https://doi.org/10.1109/TIPTEKNO68206.2025.11270162Knee injuries, and in particular abnormalities of the Anterior Cruciate Ligament (ACL) and the meniscus, are diagnosed frequently using MRI scans. Although MRI interpretations typically require expert knowledge, that expertise may not always be accessible. Recently, researchers have begun using Large Language Models (LLMs) in the medical domain, applied to assist with diagnostic interpretative tasks. Here, we investigate the potential for LLM-based chatbots to assist and augment the reasoned diagnostic interpretation of knee MRI images. Specifically, we report our comparisons across chatbot diagnostics including ChatGPT-4o, Gemini 2.5 Flash, and Claude Sonnet 4, to see if they can annotate ACL injury, meniscal tear, and abnormality of any type. Using visual MRI slices as input, we evaluated the interpretations produced by multimodal capable chatbots against the ground truth data labelled by professional radiologists. Our findings illustrate each chatbot model's relative strengths and weaknesses in medical imaging analysis that contribute evidence towards supporting the development of AI-augmented workflows for medical imaging and radiology.eninfo:eu-repo/semantics/closedAccessAiChatbotLarge Language ModelsMedical ImagingHealthcareMedical Decision-MakingA Multi-Chatbot Evaluation Framework for Knee MRI Diagnosis AssistanceConference Object10.1109/TIPTEKNO68206.2025.112701622-s2.0-105030546779