‘Complete collapse’: Bombshell report into AI accuracy indicates your job is probably safe – for now

Apple has dropped a damning report into new cutting edge AI technology – and it spells bad news for the industry.

7 min read

June 11, 2025 - 1:59PM

Companies

Don't miss out on the headlines from Companies. Followed categories will be added to My News.

The latest form of cutting-edge artificial intelligence technology suffers “fundamental limitations” that result in a “complete accuracy collapse”, a bombshell report from Apple has revealed.

Researchers from the tech giant have published a paper with their findings, which cast doubt on the true potential of AI as billions of dollars are poured into developing and rolling out new systems.

The team put large reasoning models, an advanced version of AI, used in platforms like DeepSeek and Claude, through a series of puzzle challenges ranging from simple to complex. They also tested large language models, which platforms like ChatGPT are built on.

Large language model AI systems fared better than large reasoning models with fairly standard tasks, but both fell flat when confronting more complex challenges, the paper revealed.

Tech start-up Anthropic’s service Claude relies on large reasoning models. Picture: Bloomberg

Researchers also found that large reasoning models began “reducing their reasoning effort” as they struggled to perform, which was “particularly concerning”.

“Upon approaching a critical threshold – which closely corresponds to their accuracy collapse point – models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” the paper read.

The advancement of AI, based on current approaches, might’ve reached its limit for now, the findings suggested.

Niusha Shafiabady, an associate professor of computational intelligence at Australian Catholic University and director of the Women in AI for Social Good lab, said “expecting AI to be a magic wand” is a mistake.

“I have been talking about the realistic expectations about the AI models since 2024,” Dr Shafiabady said.

“When AI models face countless interactions with the world, it is not possible to investigate and control every single problem that could happen. That is why things could get out of hand or out of control.”

Large language models, like that used by OpenAI’s ChatGPT, were also put through their paces. Picture: AP

Gary Marcus, a leading voice on AI and six-time author, delivered a savage analysis of the Apple paper on his popular Substack, describing it as “pretty devastating”.

“Anybody who thinks [large language models] are a direct route to the [artificial generative intelligence] that could fundamentally transform society for the good is kidding themselves,” Dr Marcus wrote.

Dr Marcus then took to X to declare that the hype around AI has become “a giant game of bait and switch”.

“The bait: we are going to make an AI that can solve any problem an expert human could solve. It’s gonna transform the whole world,” Dr Marcus wrote.

“The switch: what we have actually made is fun and kind of amazing in its own way but rarely reliable and often makes mistakes – but ordinary people makes mistakes too.”

Prominent AI expert Gary Marcus pointed out the significance of the paper.

In the wake of the paper’s release, Dr Marcus has re-shared passionate defences of AI shared to X by evangelists defending the accuracy flaws that have been exposed.

“Imagine if calculator designers made a calculator that worked 80 per cent correctly and said ‘naah, it’s fine, people make mistakes too’,” Mr Marcus quipped.

Prominent AI expert Gary Marcus pointed out the significance of the paper.

Questions about the quality of large language and large reasoning models aren’t new.

For example, when released in April, OpenAI described its new o3 and o4-mini models as its “smartest and most capable” yet, trained to “think for longer before responding”.

“The combined power of state-of-the-art reasoning with full tool access translates into significantly stronger performance across academic benchmarks and real-world tasks, setting a new standard in both intelligence and usefulness,” the company’s announcement read.

But testing by prestigious American university MIT revealed the o3 model was incorrect 51 per cent of the time, while o4-mini performed even worse with an error rate of 79 per cent.

Truth and accuracy undermined

Apple recently suspended its news alert feature on iPhones, powered by AI, after users reported significant accuracy errors.

Among the jaw-dropping mistakes was an alert that tennis icon Rafael Nadal had come out as gay, alleged United Healthcare CEO shooter Luigi Mangione had died by suicide in prison, and a winner had been crowned at the World Darts Championship hours before competition began.

Apple's AI-powered news summary service produced some doozies before it was shuttered. Rafael Nadal is neither Brazilian nor gay.

Research conducted by the BBC found a litany of errors across other AI assistants providing information about news events, including Google’s Gemini, OpenAI’s ChatGPT and Microsoft’s CoPilot.

It found 51 per cent of all AI-generated answers to queries about the news had “significant issues” of some form. When looking at how its own news coverage was being manipulated, the BBC found 19 per cent of answers citing its content were factually incorrect.

And in 13 per cent of cases, quotes said to be contained within BBC stories had either been altered or entirely fabricated.

BBC analysis of accuracy issues with Apple's now-suspended news summary service.

Meanwhile, a newspaper in Chicago was left red-faced recently after it published a summer reading list featuring multiple books that don’t exist, thanks to the story copy being produced by AI.

And last year, hundreds of people who lined the streets of Dublin were disappointed when it turned out the Halloween parade advertised on an events website had been invented.

Google was among the first of the tech giants to roll out AI, summarising search results relying on a large language model – with some hilarious and possibly dangerous results.

Among them were suggestions to add glue to pizza, eat a rock a day to maintain health, take a bath with a toaster to cope with stress, drink two litres of urine to help pass kidney stones and chew tobacco to reduce the risk of cancer.

Jobs might be safe – for now

Ongoing issues with accuracy might have some companies thinking twice about going all-in on AI when it comes to substituting their workforces.

So too might some recent examples of the pitfalls of people being replaced with computers.

Buy now, pay later platform Klarna shed more than 1000 people from its global workforce as part of a dramatic shift to AI resourcing, sparked by its partnership with OpenAI, forged in 2023.

But last month, the Swedish firm conceded its strong reliance on AI customer service chatbots – which saw its employee count almost halved in two years – had created quality issues and led to a slump in customer satisfaction.

Realising most customers prefer interacting with a human, Klarna has begun hiring back actual workers.

Buy now, pay later brand Klarna has been forced to rehire staff it sacked and replaced with AI. Picture: Bloomberg

Software company Anysphere faced a customer backlash in April when its AI-powered support chatbot went rogue, kicking users out of the code-editing platform Cursor and delivering incorrect information.

It then seemingly ‘created’ a new user policy out of thin air to justify the logouts – that the platform couldn’t be used across multiple computers. Cursor saw a flood of customer cancellations as a result.

AI adviser and former Google chief decision scientist Casse Kozyrkov took to LinkedIn to share her thoughts on the saga, dubbing it a “viral hot mess”.

“It failed to tell users that its customer support ‘person’ Sam is actually a hallucinating bot,” Ms Kozyrkov wrote. “It’s only going to get worse with AI agents.”

Many companies pushing AI insist the technology is improving swiftly, but a host of experts aren’t convinced its hype matches its ability.

Earlier this year, the Association for the Advancement of Artificial Intelligence surveyed two dozen AI specialists and some 400 of the group’s members and found a surprising level of pessimism about the potential of the technology.

Sixty per cent of those probed don’t believe problems with factuality and trustworthiness “would soon be solved”, it found.

Issues of accuracy and reliability are important, not just for growing public trust in AI, but for preventing unintended consequences in the future, AAAI president Francesca Rossi wrote in a report about the survey.

“We all need to work together to advance AI in a responsible way, to make sure that technological progress supports the progress of humanity and is aligned to human values,” Ms Rossi said.

Projects stalled or abandoned

Embarrassing and potentially costly issues like these are contributing to a backtrack, with analysis by S&P Global Market Intelligence showing the share of American and European companies abandoning their AI initiatives rising to 42 per cent this year from 17 per cent in 2024.

And a study released last month by consulting firm Roland Berger found a mammoth investment in AI technology wasn’t translating to useful outcomes for many businesses.

Spending on AI by corporates in Europe hit an estimated US$14 billion (AU$21.4 billion) in 2024, but just 27 per cent were able to fully integrate the technology into their operations or workflows, the research revealed.

“Asked about the key challenges involved in implementing AI projects, 28 per cent of respondents cited issues with data, 25 per cent referenced the complexity of integrating AI use cases, and 15 per cent mentioned the difficulty of finding enough AI and data experts,” the study found.

A bombshell report from Apple raises questions about the real potential of AI.

Those findings were mirrored in an IBM survey, which found one-in-four AI projects delivered the returns they promised.

Dr Shafiabady said there are a few reasons for problems facing AI, like those identified in Apple’s research.

“When dealing with highly complex problems, these types of complex AI models can’t give an accurate solution. One of the reasons why is the innate nature of algorithms,” Dr Shafiabady said.

“Models are built on mathematical computational iterative algorithms that are coded into computers to be processed. When tasks get very complicated, these algorithms won’t necessarily follow the logical reasoning and will lose track of them.

“Sometimes when the problem gets harder, all the computing power and time in the world won’t enhance AI model’s performance. Sometimes when it hits very difficult tasks, it fails because it has learnt the example rather than the hidden patterns in the data.

“And sometimes the problem gets complicated, and a lot of computation resource and time is wasted over exploring the wrong solutions and there is not enough ‘energy’ left to reach the right solution.”