Consumer AI chatbots falter when used to make medical diagnoses, particularly when faced with incomplete information, according to new research highlighting the risks of relying on them as digital doctors. The results were published in the Financial Times.
The study finds that leading large language models struggle to suggest a range of possible diagnoses when patient data is limited, frequently narrowing too quickly to a single answer.
The results point to a broader limitation in AI: While chatbots can identify likely conditions once a case is fully specified, they are less reliable at the earlier, more uncertain stages of clinical reasoning.
The findings highlight the dangers of relying on the technology alone to pinpoint health problems, particularly in cases where the data that users input may be vague or patchy.
“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information,” said Arya Rao, the study’s lead author and a researcher at the Massachusetts-based Mass General Brigham healthcare system.
The researchers evaluated 21 LLMs, including leading models by OpenAI, Anthropic, Google, xAI and DeepSeek. It found that failure rates exceeded 80 percent for all models when they needed to do so-called differential diagnosis — when full patient information was lacking.
The failure rates fell to less than 40 percent for final diagnoses with more complete data, with the best performers exceeding 90 percent accuracy.