OpenAI‘s ChatGPT is no closer to replacing your family physicians, as the increasingly advanced chatbot failed to accurately diagnose the vast majority of hypothetical pediatric cases.
The findings were part of a new study published in JAMA Pediatrics on Jan. 2, conducted by researchers from Cohen Children’s Medical Center in New York. The researchers analyzed the bot’s responses to requests for medical diagnosis of child illnesses and found that the bot had an 83 percent error rate across tests.
The study used what are known as pediatric case challenges, or medical cases originally posted to groups of physicians as learning opportunities (or diagnostic challenges) involving unusual or limited information. Researchers sampled 100 challenges published on JAMA Pediatrics and NEJM between the years 2013 and 2023.
ChatGPT provided incorrect diagnoses for 72 out of 100 of the experimental cases provided, and generated 11 answers that were deemed “clinically related” to the correct diagnosis but considered too broad to be correct.
The researchers attribute part of this failure to the generative AI’s inability to recognize relationships between certain conditions and external or preexisting circumstances, often used to help diagnose patients in a clinical setting. For example, ChatGPT did not connect “neuropsychiatric conditions” (such as autism) to commonly seen cases of vitamin deficiency and other restrictive-diet-based conditions.
The study concludes that ChatGPT needs continued training and involvement of medical professionals that feeds the AI not with an internet-generated well of information, which can often cycle in misinformation, but on vetted medical literature and expertise.
AI-based chatbots relying on Large Language Models (LLMs) have been previously studied for their efficacy in diagnosing medical cases and in accomplishing the daily tasks of physicians. Last year, researchers tested generative AI’s ability to pass the three-part United States Medical Licensing Exam — It passed.
But while it’s still highly criticized for its training limits and potential to exacerbate medical bias, many medical groups, including the American Medical Association, don’t view the advancement of AI in the field just as a threat of replacement. Instead, better trained AI’s are considered ripe for their administrative and communicative potential, like generating patient-side text, explaining diagnoses in common terms, or in generating instructions. Clinical uses, like diagnostics, remain a controversial, and hard to research, topic.
To that extent, the new report represents the first analysis of a chatbot’s diagnostic potential in a purely pediatric setting — acknowledging the specialized medical training undertaken by medical professionals. Its current limitations show that even the most advanced chatbot on the public market can’t yet compete with the full range of human expertise.