If you asked a doctor whether to use ice to treat a burn, they would quickly advise you to run it under cold water instead. Even “Dr Google” will tell you that extreme cold constricts the blood vessels and can make a burn worse.
But what happens when you ask ChatGPT the same question? The chatbot will tell you it’s fine to use ice – so long as you wrap in a towel.
The question is one of a hundred common health queries that Australian researchers used to test the chatbot’s ability to provide medical advice.
They found the software was fairly accurate when asked to provide a yes or no answer, but became less reliable when given more information – answering some questions with just 28 per cent accuracy.
Co-author Dr Bevan Koopman, CSIRO principal research scientist and associate professor at the University of Queensland, has spent years looking at how search engines are used in healthcare.
He said people were increasingly using tools such as ChatGPT for medical advice despite the well-documented pitfalls of seeking health information online.
“These models have come on to the scene so quickly ... but there isn’t really the understanding of how well they perform and how best to deploy them,” he said. “In the end, you want reliable medical advice … and these models are not at all appropriate for doing things like diagnosis.”
The study compared ChatGPT’s response to a known correct response for a set of questions developed to test the accuracy of search engines such as Google.
It answered correctly 80 per cent of the time when asked to give a yes or no answer. But when provided with supporting evidence in the prompt, accuracy was reduced to 63 per cent, and fell to 28 per cent when an “unsure” answer was allowed.
Inverting the prompts to frame the question as a negative also reduced the accuracy of its answers – from 80 per cent to 56 per cent for the yes/no option, and from 33 per cent to just 4 per cent when it was given a third option of “unsure”.
Koopman said large language models such as ChatGPT were only as good as the information they were trained on, and hoped the study would provide a stepping stone for the next generation of health-specific tools “that would be much more effective”.
A national road map for artificial intelligence (AI) in healthcare, released last year, recommended the government “urgently communicate the need for caution” when using generative AI that is untested and unregulated in healthcare settings.
Professor Enrico Coiera, the director of Macquarie University’s Centre for Health Informatics and one of the authors of the road map, said some doctors were using large language models to help them take patient notes and write letters, but these had so far avoided the regulation and testing hurdles that every other health technology has to go through.
“In Silicon Valley they say, ‘move fast and break things’. That’s not a good mantra in healthcare where the things you might break are people,” he said.
Large language models construct sentences by assessing a huge database of words and how often they appear next to each other. They are chatty and easy to use but “don’t know anything about medicine”, Coiera said, and therefore should be supported by another kind of AI that can better answer health-related questions.
Dr Rob Hosking, a GP and the chairman of the Royal Australian College of General Practitioners’ technology committee, said there was a place for large language models in healthcare “if it’s trained on medical quality data, and supervised by a clinician who knows how to understand the data”.
“It’s really no different from our perspective – people come in with information they’ve got from friends, family or the internet,” he said. “It’s a bit like the move from using pen and paper to using a word processor – it’s a tool. We can’t take it as gospel.”
Start the day with a summary of the day’s most important and interesting stories, analysis and insights. Sign up for our Morning Edition newsletter.