The AI model currently used to diagnose genetic disorders from patients' written descriptions
August 14, 2024
Press release
Tuesday August 14, 2020
NIH researchers have found that evaluations of large language models rely heavily on concise, textbook-like language.
Ernesto del Aguila III, National Human Genome Research Institute
Researchers at the National Institutes of Health have found that while AI tools can accurately diagnose genetic disorders based on textbook descriptions, they perform significantly worse when analyzing patients' summaries of their health. The finding, Reports in the American Journal of Human GeneticsDemonstrate the need to improve these AI tools before being applied to healthcare settings to help with diagnosis and answer patients' questions.
The researchers studied an AI model known as large language models, which are trained using massive text data. These models are useful in the medical field because they can analyze questions, answer them, and have user-friendly interfaces.
Dr. Ben Solomon is the lead author and clinical director of the National Human Genome Research Institute at the NIH. For example, the electronic medical record and conversations between doctors and patients are all words. The field of AI has made great strides with large language models. Being able to use words for clinical purposes could have a profound impact.
The researchers tested 10 major language models, including the two most recent versions of ChatGPT. The researchers created questions based on medical texts and other references. The conditions included well-known conditions, such as sickle cell disease, cystic fibrosis, and Marfan syndrome, but also many other rare conditions.
The researchers selected three to five symptoms for each condition and then created questions that were phrased in a standard format, “I have symptoms X, Y, and Z.” The researchers selected three to five symptoms per condition and then created questions that were phrased in a standard format, “I have symptoms XYZ.” Which genetic condition is most likely?
Large language models were able to correctly diagnose genetic disorders in a wide range of situations. Initial accuracy ranged from 21 % to 90 %. The best performing model was GPT-4, which is one of the latest versions of ChatGPT.
The models’ success was generally correlated with the size of their data—that is, the number of training datasets. While the smallest models have only a few billion parameters, the largest include more than a trillion. The researchers were able to improve the accuracy of many of the weaker models over time. Overall, the models performed better than non-AI technology, such as a Google search.
The researchers tested and optimized the models by changing the medical terminology to more familiar language. Instead of saying a child is “macrocephalic,” the question could say “a big head” to better reflect how parents or caregivers would describe the symptom.
The accuracy of the models decreased overall when medical terms were removed. 7 out of 10 models were more accurate when using common language than Google.
Kendall Flaharty is a post-baccalaureate fellow at NHGRI who led the study. There are very few clinical geneticists in the world, and some people in states and countries don’t have access to them. AI could be used to answer some questions without having to wait years.
Researchers asked patients at the NIH Clinical Center to briefly describe their symptoms and genetic conditions. Descriptions ranged from one sentence to several paragraphs, and varied in style and content.
The best model only made correct diagnoses 21% of the time when faced with real patient descriptions. Some models were even worse than that, with accuracy as low as 1%.
The researchers expected that the patient report summaries would be more difficult to write because many patients at the NIH Clinical Center have rare diseases. The templates may not contain enough information to diagnose these rare diseases.
The models' accuracy improved, however, when the researchers asked standardized questions about ultra-rare genetic disorders found in the NIH patient population. The models struggled to interpret the different formats and wording of patient reports, perhaps because they are accustomed to manuals or other standardized reference materials.
Solomon said that for these models to continue to serve as clinically useful tools, they will need to include more information. This data also needs to reflect the diversity of the patient population. Dr. Solomon said we need to include not only all medical conditions, but also variations in race, age, gender, culture and other factors, to ensure that the data reflects patients’ diverse experiences. These models will then be able to learn how people talk about different conditions.
This study not only highlights areas for improvement, but also the limitations of current large-scale language models, and underscores the need for continued human supervision when AI is used in healthcare.
Solomon continued: “These technologies have already been implemented in the clinical setting.” The bigger question is not if and how doctors will use AI. It’s where, when and why they should do it.
National Human Genome Research Institute The NIH is one of 27 centers and institutes within the Department of Health and Human Services. The NHGRI Intramural Research Division is responsible for developing and implementing technologies that help diagnose, treat, and understand genomic diseases. You can find more information about the NHGRI at the following website: https://www.genome.gov/.
The National Institutes of Health: The NIH is the medical research agency of the U.S. Department of Health and Human Services. It consists of 27 institutes and centers. The NIH, the nation's medical research agency, is a component of the U.S. Department of Health and Human Services. It is responsible for conducting basic, translational, and clinical medical research and investigating the causes, treatments, and cures. Visit the NIH for more information about its programs. www.nih.gov.
NIH…Transforming Discovery into Healthcare(r)
###