Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence
- Health Informatics Journal : 1-17
Résumé
Background: Large language models (LLMs) have demonstrated promising capabilities in medical diagnostic reasoning, yet
their performance in specialized clinical domains such as rheumatology remains incompletely characterized. While diagnostic
accuracy has been evaluated, critical dimensions including calibration, reasoning quality, and temporal stability
have not been systematically assessed across contemporary models. Objectives: This study aimed to comprehensively
evaluate and compare the diagnostic accuracy, certainty expression, reasoning quality, and hallucination rates of four stateof-
the-art LLMs ChatGPT-4, Claude 3.5, DeepSeek-V3, and Gemini 1.5 Pro in complex rheumatologic case scenarios. Design: A
cross-sectional, analytical, and comparative study was conducted following STARD and TRIPOD guidelines, adapted for LLM
evaluation. Nine complex rheumatologic cases from published case reports were evaluated at three time points (Days 1, 5,
and 10) between July 1 and September 18,2025. Methods: Standardized clinical vignettes were submitted to each LLM under
controlled experimental conditions. Two blinded senior rheumatologists independently assessed diagnostic accuracy,
reasoning quality across five analytical dimensions using Likert scales, and hallucination frequency. Certainty expression
and temporal stability were quantified using intraclass correlation coefficients. Correlation analyses examined relationships
between reasoning quality and confidence expression. Results: All models achieved near-perfect diagnostic accuracy, with
ChatGPT, Claude and Gemini correctly identifying the primary diagnosis in 100% of cases and DeepSeek in 88.9%. However,
Spearman correlation analysis revealed uniformly weak and non-significant associations between reasoning quality and
expressed certainty across all models (ρ range: -0.156 to 0.215, all p>0.05), indicating fundamental miscalibration. ChatGPT
demonstrated the highest reasoning score (3.89±0.23) and lowest hallucination rate (7.4%), while Gemini showed the
highest hallucination frequency (18.5%). Temporal stability was excellent for ChatGPT (ICC=0.84) and good for DeepSeek
(ICC=0.79). Conclusion: Despite exceptional diagnostic accuracy, current LLMs exhibit critical limitations in confidence
calibration and variable hallucination rates, representing significant barriers to safe clinical deployment in rheumatology.
Mots-clés
large language models, artificial intelligence, diagnostic reasoning, rheumatology, clinical decision support, calibration