Détails Publication
Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence,
Lien de l'article:
Discipline: Médecine clinique
Auteur(s): Yannick Laurent Tchenadoyo Bayala1* , Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo1 , Dieu- Donné Ouedraogo1 , Fulgence Kaboré1 , Charles Sougué2 , Aristide Relwendé Yameogo3 , Wendlassida Martin Nacanabo4 , Ismael Ayouba Tinni1 , Aboubakar Ouedraogo1 and Yamyellé Enselme Zongo1
Renseignée par : OUEDRAOGO Dieu-Donné
Résumé

Background Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical
practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context.
To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African
population.
Methods This was a cross-sectional analytical study with retrospective data collection, conducted at the
Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024.
Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses
proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists.
Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.
Results Among the patients enrolled in the study period, infectious diseases constituted the most common
diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%),
followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate,
with Cohen’s kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity
(> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50
years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16–
9.71]; p = 0.025).
Conclusion LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite
some limitations in specific disease categories.

Mots-clés

Artificial intelligence, Large Language Models, Rheumatology, Diagnostic accuracy, Africa

936
Enseignants
7721
Publications
49
Laboratoires
101
Projets