This was published 1 year ago
Signal failure: Inside Optus’ day from hell
Australia’s second-largest telco suffered the worst outage in recent memory this week, which took almost 16 hours to recover from and affected more than 10 million customers. Here’s what happened.
By David Swan and Ben Grubb
It was just after 4am on Wednesday that the day from hell began for Optus.
At about 7.45am, 3 hours and 45 minutes after the massive disruption began hitting its entire customer-facing network, chief executive Kelly Bayer Rosmarin arrived at the telco’s Macquarie Park network operations centre in Sydney’s north-west to confront the disaster recovery that lay ahead, according to an Optus source who did not wish to be identified due to the sensitive nature of the outage.
In an interview, Bayer Rosmarin did not dispute this, saying “nobody has questioned where I was - I was here onsite with the team”.
The Optus operations centre, or so-called “nerve centre”, is where staff monitor and maintain its network and where Bayer Rosmarin and other employees would spend the rest of the day and week investigating what had caused its network to collapse.
Numerous incidents were reported in the early hours of the morning, one of which was recorded at 5.19am: Optus’ own IT systems were crippled by the outage.
By 9am, staff were under the impression that one of its content distribution partners, Akamai, may have contributed to the outage, and a senior Optus executive called an Akamai counterpart trying to get them to fix the problem, the source said.
By Friday afternoon, Optus still suspected an external party triggered the issue, with engineers baffled by a chain of events they had never imagined. Akamai initially directed inquiries about its potential involvement in the outage to Optus, which did not comment.
Akamai has clarified that its technology was not responsible for the outage.
“We have no present indication that this incident is related to an issue with Akamai. We stand ready to support Optus and our partners at all times,” an Akamai spokesperson said on Saturday.
At 9.37am on Wednesday, Optus told partners privately that field technicians had been dispatched to address the issue at two crucial exchange locations: the Sunshine Exchange, in Melbourne’s west, and a separate one in Burwood East, in the city’s east.
“Upon initial assessment, it has been identified that troubleshooting of routers and router reflectors [a type of router] may be necessary to resolve the problem effectively,” it said. “We are still confirming the restoration details in terms of ETR [estimated time of restoration] and we will endeavour to provide that shortly.”
An earlier message at 7.45am said the issue lay with “route reflectors, which are currently handling an excessive number of routes, leading to session shutdown and a complete traffic halt”.
“Our on-site technician is actively prioritising establishing a console connection [a physical cable connection to a router]. Rest assured that said technician is also being provided additional technical support remotely.”
Mobile phones of staff at the operations centre, as well as other Optus staff, had no signal – even the chief executive’s phone was unable to connect to mobile towers, forcing her to call ABC radio via WhatsApp just after 10.30am to publicly apologise to customers.
By about 5.50pm, Optus advised staff internally that a team from Nokia Networks, which manages its routers, had performed a “manual restart” of router reflectors across all sites and of the Border Gateway Protocol (which is how telco owners and operators’ routers share routing information).
By Friday afternoon, two days after the outage, Optus issued a “change freeze”, a directive that all internal IT systems not be changed in any way until November 13, the source said. The move is an indication that Optus fears even a routine upgrade could trigger another disaster.
This week’s outage, which ultimately crippled large numbers of the nation’s businesses, hospitals and Victoria’s rail networks, has highlighted the vulnerability of Australia’s telecommunications networks – and may necessitate new laws to prevent a repeat – according to industry insiders.
Sam Pratt, the chief executive of infrastructure provider Render, tells this masthead that chronic underinvestment over the last decade has meant Australia’s economy effectively grinds to a halt whenever major network issues occur.
“Outages like what we’ve seen with Optus are the cost of not investing and further future-proofing our fibre networks, which serve as a crucial backbone for all telecommunications services, including 4G and 5G,” Pratt says.
“The rapid acceleration in consumer demand for bandwidth will continue to challenge wireless network operators who, despite significant network investment, are struggling to keep pace.”
While localised service disruptions are relatively common, a total national outage by one key provider can bring a large chunk of the country to a standstill. All three major operators have over the last decade suffered major outages: Telstra dealt with months of technical issues in 2018 while Vodafone earned itself a “Vodafail” tag in 2010 for its multiple network failures, a label that dogged the telco for years.
Vodafone lost more than two million customers between 2010 and 2013, when its reputation for generous add-ons and value was quickly replaced by one of call dropouts and patchy coverage.
“Optusfail” will have far more profound consequences, given the critical functions – payments, transport infrastructure, hospitals – that have migrated to mobile networks in the last decade.
Ten years ago, Australians could easily go to a bank branch or an ATM to access banking services in the event of a telco outage. Now there are fewer ways to get cash, less cash being used, and the nation’s banking, energy, electricity, transport and health systems are all reliant on telecommunications networks, which often are susceptible to a single point of failure.
This week, businesses across Australia faced disruptions as payment systems froze, while any applications requiring two-factor authentication or text message verifications, like banking apps, were also hamstrung.
“Critical systems need to have some form of redundancy,” Jane MacMaster, chief engineer at Engineers Australia, told this masthead. “The national electricity systems have redundancy built-in, they’re required by legislation to have that. Hospitals also have built-in redundancies with some of their equipment that can’t afford to go down.
“That’s the question we should all be asking: do our critical systems have the appropriate redundancy or some other mechanism for ensuring that the single points of failure are adequately managed?”
Narelle Clark, who worked in various senior roles at Optus between 1998 and 2008 and is now chief executive of the Internet Association of Australia, which provides an internet exchange service to some of Australia’s largest telcos, said Optus should have had a way to remotely connect to the routers inside its data centres that was separate from its own network. This could have been done via SIM cards from other networks.
“All of mine have two SIMs into them,” Clark said. “One is Optus and the other Telstra.”
Clark said it appeared that Optus had inadequate network segmentation, which would have mitigated some of the outage. While in the past fixed-line and mobile networks were completely separate, they were now converged, she said. Even so, modern networks still needed some form of logical, regional or functional separation to contain faults and assist in staged restoration.
Vodafone is not the only telco to serve as an object lesson for Optus executives in learning from failure.
Last year, Canadian telco Rogers suffered an outage affecting 12 million users, with around 25 per cent of Canada losing internet connectivity for about 15 hours. Industry experts say the cause of the outage – a faulty maintenance upgrade that caused a router to malfunction – is potentially what felled Optus’ network.
The Canadian government moved quickly, passing laws – with industry agreement – requiring the nation’s telcos to provide mutual assistance in the event of an outage, and emergency roaming services to rivals’ affected customers. New guidelines were also established for telcos about how to communicate to the public during outages.
There are now calls for Australia to look at passing similar laws, and all eyes are on the Australian federal government to see how it reacts to the incident.
Communications Minister Michelle Rowland has already called a post-incident inquiry into the outage, while the Senate and communications watchdog the Australian Communications and Media Authority will also probe Optus’ handling of the incident. At least two state governments, South Australia and Victoria, have also said they will review their contracts with the telco.
MacMaster says the government should consider mandating capacity sharing in some instances as a way to protect against a single point of failure.
Rowland wouldn’t be drawn on whether wide-reaching mutual assistance laws like those passed in Canada were needed here, but said she’s focused for now on making sure emergency roaming would be possible during natural disasters. The fact that some Optus customers were unable to access triple zero particularly rankled.
“The nationwide Optus outage was incredibly distressing for millions of Australians,” Rowland told this masthead.
“We’ve announced that the federal government will commence a post-incident review to ensure we understand what happened, what went wrong and what improvements can be made by the industry.”
Last month, Rowland and Emergency Management Minister Murray Watt tasked the Department of Communications and the National Emergency Management Agency to scope work on the development of a temporary emergency mobile capability to be activated during natural disasters.
Such a move would allow Australians to connect to any available mobile network during bushfires, floods and other emergencies.
“This will require co-operation between mobile network carriers, but comes following the ACCC finding the capability was technically feasible,” Rowland said.
“We’re continuing to monitor this space to ensure we have the right settings to support Australians [to] stay connected during disasters.”
Some say new laws should go further to prevent an outage of the magnitude of the Optus collapse from happening again.
“The lack of performance-related telecommunications regulations in Australia is to some extent a contributing factor that led to the Optus national outage,” says RMIT associate professor Dr Mark Gregory.
“There is a need for government to legislate to prevent a reoccurrence of Optus’s national outage. The loss of the triple zero emergency call service, even for one day, should not have happened, and it means the Optus network is not fit for purpose.
“The Optus national outage provides a strong indication that Optus has underinvested in the engineering, infrastructure and systems required to ensure that its network is robust and fit for purpose.”
But David Thodey, the former chief executive of Telstra, said he was doubtful that regulation would have any material impact on outages. There’s some acceptance among industry figures that outages are simply a fact of life, even with regulation.
“I know that no telco would ever do anything to [deliberately] compromise their network ... So it is difficult to see what further regulation would do that would practically change any behaviour,” he said.
The government is moving quickly but so too are Optus’ competitors.
Customers lined up at Telstra and Vodafone stores during the week, running out of patience with Optus after already having their information stolen by hackers a year ago. Many are unhappy with the telco’s offer of 200 gigabytes of free data, which has widely been seen as underwhelming. One Optus customer took to task platform Airtasker offering $150 for someone to buy a SIM card from a rival competitor on her behalf.
Statistics from IBISWorld show Telstra commands a 36.5 per cent market share locally, with Optus holding about 18 per cent.
TPG, the parent company of Vodafone, said in the 24 hours after the outage it saw more than a 400 per cent increase across all of its brands, including Vodafone, TPG, iiNet, Felix and Lebara, in what was its busiest sales day of the past year.
Smaller rivals too are enjoying their moment in the sun.
“This is a wake-up call that you don’t need to be with the same company you’ve always been with, and it’s a great time to shop around,” says Jason Haynes, general manager at Boost Mobile.
Bayer Rosmarin is choosing to stay optimistic, hoping customers will stick around. “Our message to customers every day is that Optus is a company that’s a real customer champion,” she says.
“We strive every day to give our customers the best possible value for money, a great network experience and unique features that they can’t get anywhere else, and we will continue to do that day in and day out.
“Today was a bad day but every other day we deliver on that promise for our customers.”
Get news and reviews on technology, gadgets and gaming in our Technology newsletter. Sign up to receive it every Friday.
clarification
This article has been updated since publication to include a response from Akamai provided on Saturday. According to Akamai, its technology was not responsible for the outage.