NIH findings highlight risks and benefits associated with integrating AI into medical decisions
July 23, 2024
Press release
Tuesday July 23, 2020
The AI model performed well on the medical diagnosis quiz but made mistakes in explaining the answers.
GPT-4V is an AI model that made many mistakes in describing medical images and explaining the reasoning behind its diagnosis, even when it was correct.NIH/NLM
Researchers at the National Institutes of Health (NIH) found an AI model that solved quiz questions designed to assess healthcare professionals’ ability to diagnose patients based on clinical images and a short text summary with high accuracy. However, physicians evaluating the AI model found it inaccurate when it came to describing the images or explaining the reasoning for the answer. These findings highlight the potential of AI in clinical settings. They were published in Digital Medicine – npj. Researchers from the National Library of Medicine of NIH (NLM) in New York and Weill Cornell Medicine led the study.
The study showed that AI was not yet advanced enough to replace human expertise, which is essential for an accurate diagnosis.
AI and human doctors answered patients' questions New England Journal of Medicine (NEJM) Image Challenge. This is an online challenge that asks the user to select the correct diagnosis based on multiple-choice answers. The images are real and the text descriptions include details about symptoms and appearances. The researchers asked the AI model for 207 answers to the image challenge questions, along with a written rationale to support each one. The prompt specified that the written rationale should include a description of the image, a summary of the medical knowledge relevant to the question, and a step-by-step explanation of how the AI model selected the answer.
The researchers recruited nine physicians, all from different specialties, from various institutions. They answered the questions in two ways: first, in a “closed book” mode (without using external sources such as websites), and then, in an “open book” mode (using online resources). The researchers then gave the physicians the correct answer, along with the AI’s answer and rationale. The physicians then evaluated the AI model on its ability to summarize medical information, describe an image, and provide a step-by-step explanation.
The researchers found that both the AI model and the doctors performed very well in choosing the correct diagnosis. Interestingly, the AI models selected the correct diagnoses more frequently than the doctors when using closed-book environments, but the doctors who used open-book software performed better.
The AI model, when evaluated by doctors, often makes mistakes in describing medical images and explaining the reasoning behind its diagnosis, even if it made the correct final decision. The AI model was given a photo of a person’s arm with two lesions. The doctor could easily tell that the two lesions were caused by a single condition. The AI model failed to recognize the diagnosis because of the different angles from which the lesions appeared.
The researchers say this research supports the need to further evaluate multimodal AI technologies before they are introduced into clinical settings.
Zhiyong Li, Ph.D., principal investigator at NLM and corresponding author of the study, said, “This technology can help clinicians improve their capabilities with data-driven insights that could lead to better clinical decision-making.”
This study used an AI model called GPT-4V, which can combine multiple types of data, including images and text. The researchers note that this small study highlights the potential of multimodal AI to help doctors make medical decisions. More research is needed to compare the accuracy of these models to that of doctors in diagnosing patients.
Harvard Medical School, Massachusetts General Hospital and Case Western Reserve University School of Medicine in Cleveland, the University of California San Diego La Jolla and the University of Arkansas Little Rock all collaborated on the study.
The National Library of Medicine is the world's largest biomedical research library. It is also a leader in data science, biomedical informatics, and biomedical informatics. NLM supports and conducts research on methods for recording, storing, and retrieving health information, as well as preserving and communicating it. NLM provides resources and tools used by millions of people each year to analyze and access information on molecular biotechnology, environmental health, and toxicology. More information can be found at
https://www.nlm.nih.gov. The National Institutes of Health:
NIH is the medical research agency of the United States Department of Health and Human Services. It includes 27 institutes and centers. The NIH, the nation's medical research agency, is comprised of 27 institutes and centers and is part of the U.S. Department of Health and Human Services. Visit NIH for more information about its programs. www.nih.gov .NIH…Transforming Discovery into Healthcare
(r)Refer to the following:
Qiao Jin, et al. Hidden defects of GPT-4 multimodal vision in medicine. Digital Medicine. DOI:
10.1038/s41746-024-01185-7 (2024). ###