The CrowdStrike fail and next global IT meltdown already in the making
When computer screens went blue worldwide on Friday, flights were grounded, hotel check-ins became impossible, and freight deliveries were brought to a stand-still. Businesses resorted to paper and pen. And initial suspicions landed on some sort of cyberterrorist attack. The reality, however, was much more mundane: a botched software update from the cybersecurity company CrowdStrike.
“In this case, it was a content update,” said Nick Hyatt, director of threat intelligence at security firm Blackpoint Cyber.
And because CrowdStrike has such a broad base of customers, it was the content update felt around the world.
“One mistake has had catastrophic results. This is a great example of how closely tied to IT our modern society is — from coffee shops to hospitals to airports, a mistake like this has massive ramifications,” Hyatt said.
In this case, the content update was tied to the CrowdStrike Falcon monitoring software. Falcon, Hyatt says, has deep connections to monitor for malware and other malicious behavior on endpoints, in this case, laptops, desktops, and servers. Falcon updates itself automatically to account for new threats.
“Buggy code was rolled out via the auto-update feature, and, well, here we are,” Hyatt said. Auto-update capability is standard in many software applications, and isn’t unique to CrowdStrike. “It’s just that due to what CrowdStrike does, the fallout here is catastrophic,” Hyatt added.
The blue screen of death errors on computer screens are viewed due to the global communications outage caused by CrowdStrike, which provides cyber security services to US technology company Microsoft, on July 19, 2024 in Ankara, Turkey.
Harun Ozalp | Anadolu | Getty Images
Even though CrowdStrike quickly identified the problem, and many systems were back up and running within hours, the global cascade of damage isn’t easily reversed for organizations with complex systems.
“We think three to five days before things are resolved,” said Eric O’Neill, a former FBI counterterrorism and counterintelligence operative and cybersecurity expert. “This is a bunch of downtime for organizations.”
It did not help, O’Neill said, that the outage happened on a summer Friday with many offices empty, and IT to help to resolve the issue in short supply.
Software updates should be rolled out incrementally
One lesson from the global IT outage, O’Neill said, is that CrowdStrike’s update should have been rolled out incrementally.
“What Crowdstrike was doing was rolling out its updates to everyone at once. That is not the best idea. Send it to one group and test it. There are levels of quality control it should go through,” O’Neill said.
“It should have been tested in sandboxes, in many environments before it went out,” said Peter Avery, vice president of security and compliance at Visual Edge IT.
He expects more safeguards are needed to prevent future incidents that repeat this type of failure.
“You need the right checks and balances in companies. It could have been a single person that decided to push this update, or somebody picked the wrong file to execute on,” Avery said.
The IT industry calls this a single-point failure — an error in one part of a system that creates a technical disaster across industries, functions, and interconnected communications networks; a massive domino effect.
Call to build redundancy into IT systems
Friday’s event could cause companies and individuals to heighten their level of cyber preparedness.
“The bigger picture is how fragile the world is; it’s not just a cyber or technical issue. There are a ton of different phenomena that can cause an outage, like solar flares that can take out our communications and electronics,” Avery said.
Ultimately, Friday’s meltdown wasn’t an indictment of Crowdstrike or Microsoft, but of how businesses view cybersecurity, said Javad Abed is an assistant professor of information systems at Johns Hopkins Carey Business School. “Business owners need to stop viewing cybersecurity services as merely a cost and instead as an essential investment in their company’s future,” Abed said.
Businesses should be doing this by building redundancy into their systems.
“A single point of failure shouldn’t be able to stop a business, and that is what happened,” Abed said. “You can’t rely on only one cybersecurity tool, cybersecurity 101,” Abed said.
While building redundancy into enterprise systems is costly, what happened Friday is more expensive.
“I hope this is a wake-up call, and I hope it causes some changes in the mindsets of the business owners and organizations to revise their cybersecurity strategies,” Abed said.
What to do about ‘kernel-level’ code
On a macro level, it is fair to assign some systemic blame within a world of enterprise IT that often views cybersecurity, data security, and the tech supply chain as “nice-to-have things” instead of essentials, and a general lack of cybersecurity leadership within organizations, said Nicholas Reese, former Department of Homeland Security official and instructor at New York University’s SPS Center for Global Affairs.
On a micro level, Reese said the code that caused this disruption was kernel-level code, impacting every computer hardware and software communication aspect. “Kernel-level code should get the highest level of scrutiny,” Reese said, with approval and implementation needing to be entirely separate processes with accountability.
That’s a problem that will continue for the entire ecosystem, awash in third-party vendor products, all with vulnerabilities.
“How do we look across the ecosystem of third-party vendors and see where the next vulnerability will be? It is almost impossible, but we have to try,” Reese said. “It is not a maybe, but a certainty until we grapple with the number of potential vulnerabilities. We need to focus on backup and redundancy and invest in it, but businesses say they can’t afford to pay for things that might never happen. It’s a hard case to make,” he said.