NewsBite

How a quick thinking IT team saved 1000 PCs from ‘blue screen of death’

How an IT team at a HR firm responsible for the payments, rostering and management of around 2.5 million people contained their systems in the CrowdStrike outage.

CrowdStrike facing 'reputational damage' following global IT outages

It was about 3pm on Friday when Kelvin Yip first knew there was a problem.

The head of information security, who runs a team of 20 people, began to watch as Employment Hero’s Slack chat began to blow up.

One by one, staff members were asking what was going on, sharing screenshots of what eventually became known as the “Blue Screen of Death”.

What Mr Yip and the rest of the world would soon find out was a routine IT update from their trusted supplier, CrowdStrike, effectively brought it and thousands of other companies to their knees — standing down flights, leaving banks unable to transact and leaving thousands of businesses including major supermarkets unable to take payment.

Jetstar check-in at Brisbane Airport after the CrowdStrike global IT outage on Saturday. Picture: Richard Walker
Jetstar check-in at Brisbane Airport after the CrowdStrike global IT outage on Saturday. Picture: Richard Walker

Behind the outage was US-based CrowdStrike, one of the world’s most valuable endpoint cyber security providers, which had botched an update that would cripple Microsoft computers world over.

Mr Yip’s team convened immediately on a call.

Mr Yip was determined not to waste a crisis, revealing the lesson his team have learned to reduce the risk of another mass disruption.

Within Employment Hero, which provides human resources, payroll and employment management services to more than 300,000 small to medium businesses, there are two teams who respond to such incidents.

One is in IT, the other is a security team. Both work hand-in-hand in the event of outage, breach or other software malfunction, Mr Yip said.

“Our IT and security teams are well-versed in our disaster response playbooks, which are regularly tested to ensure preparedness,” he said.

“But, in this scenario, due to the number of users affected, it was an ‘all-hands-on-deck’ approach to contain the issue and restore functionality as quickly as possible.”

As tens of dozens more screens went blue for Employment Hero staff and they were left unable to work, there was a bigger problem looming at hand.

The Blue Screen of Death. Picture: Supplied
The Blue Screen of Death. Picture: Supplied

“We also knew that we were against the clock as our team in the UK were about to power on their computers in less than 2 hours, so we had to contain the issue before then.”

Employment Hero has a total of 1254 staff and the company’s software is responsible for 2.5 million-plus employees globally, and an outage could severely mess up staff rosters, limit planning and in some cases restrict workers from being paid.

“There was lots of facepalming which then turned into adrenaline as we realised that the number of affected users was steadily increasing,” Mr Yip said.

Employment Hero, like many of the world’s largest companies, is a CrowdStrike customer. A quick call to their CrowdStrike account manager didn’t solve any problems nor really tell them what the issue was, only letting them know a fix eventually be on the way.

But, by this point, hundreds of the company’s own computers were turning blue, as people began to speculate online and news outlets had caught on.

Employment Hero head of information security Kelvin Yip.
Employment Hero head of information security Kelvin Yip.

That’s when Employment Hero’s response team worked out they could isolate the issue in computers which were yet to be impacted.

But, it was risky. “We were cautious because we wanted to be sure that we weren’t making matters worse,” Mr Yip said,

“We used test devices first, so we were confident before applying the fix to the rest of the computers.”

This was at 4:30pm, about 90 minutes after the company first began to show the Blue Screen of Death.

The company rolled out the script.

“To mitigate further impact, we promptly executed a script remotely using Intune, our endpoint management tool, to protect any computers that had not yet been affected,” Mr Yip said.

“This was especially important because our team in the UK would start their business day about two hours from when we first identified the problem.

“By proactively executing the script, we prevented a substantial increase in the number of devices requiring repairs.”

The Terry White Chemist at the Gasworks in Brisbane closed as its staff were unable to use its sytems during the CrowdStrike outage. Picture: David Clark
The Terry White Chemist at the Gasworks in Brisbane closed as its staff were unable to use its sytems during the CrowdStrike outage. Picture: David Clark

As the adrenaline began to build, Mr Yip said there was one thing in the back of his mind giving him comfort.

“Our critical files and data are backed up regularly and securely. This ensures that, in the unfortunate scenario where the computers are beyond recovery, our data remains secure and accessible,” he said.

“Our backup systems are an integral part of our disaster recovery plan, designed to ensure business continuity and minimise downtime not only for Employment Hero, but for the 300,000 SMEs and the 2.5 million employees globally that rely on us for payroll and HR.”

The fix worked. A total of 267 devices at Employment Hero were affected by the outage but several hundred more would have also been rendered useless had it not been for the company’s security and IT team, Mr Yip said.

But, there was still the issue of the 267 computers which had been left temporarily unusable by CrowdStrike’s update.

“From there, our IT team spent the next 2 days helping the rest of the business restore functionality,” Mr Yip said.

The blue screen seen at a bus stop in the US. Picture: Justin Sullivan/Getty Images/AFP
The blue screen seen at a bus stop in the US. Picture: Justin Sullivan/Getty Images/AFP

While the outage and situation was alarming, Mr Yip said he was proud to see all the drills and system disaster recovery had come to fruition.

“This rapid mobilisation is a testament to our agile and proactive company culture, which is a core aspect of doing things the EH way,” he said.

Mr Yip said because Employment Hero is a remote-first company, its staff were used to collaborating at speed from all over the country to troubleshoot and solve issues.

“Our instant communication and co-ordination is facilitated by our remote-first infrastructure, which ensures that all necessary information is available and accessible to key team members without any delay,” he said.

Joseph Lam
Joseph LamReporter

Joseph Lam is a technology and property reporter at The Australian. He joined the national daily in 2019 after he cut his teeth as a freelancer across publications in Australia, Hong Kong and Thailand.

Add your comment to this story

To join the conversation, please Don't have an account? Register

Join the conversation, you are commenting as Logout

Original URL: https://www.theaustralian.com.au/business/technology/how-a-quick-thinking-it-team-saved-1000-pcs-from-blue-screen-of-death/news-story/06db43eda8d947363a18f6772b23f9ed