Nick Ellsmore, 22 July, 2024
The CrowdStrike issue is not a case of ‘security company breaks the world’. Instead, it is a more nuanced discussion around business resilience.
At its heart, the CrowdStrike software calamity is not a cybersecurity issue at all. This is just your run-of-the-mill, happens-every-few-years, epic IT stuff-up, when someone with a broad-enough deployment base pushes out a dodgy update, and for a heap of businesses who are in the wrong place at the wrong time, the world ends for a few days (or possibly longer, from ongoing developments).
In this case, it happened to be a cybersecurity software company that pushed out the update, but if it were a patch pushed out by a printer company, we wouldn’t say this was a “printer issue” (or a cybersecurity issue for that matter), it would just be an IT outage we needed to deal with.
The fact that the whole cybersecurity apparatus and industry swung into gear, from the Australian Government’s coordination bodies, to consultants and providers, is actually new and different. The extent of this cohort’s involvement in previous scenarios like this would really just be to say “nope, not a hack, good luck with it” as they threw it back to the IT departments to grind away.
For a long time, we have had the triad of C-I-A: Confidentiality, Integrity and Availability, as the three core pursuits in cybersecurity. Availability has always been the step-cousin that didn’t get nearly as much attention as the first two. We’ve seen a lot of change in that area over the last couple of years, driven by outages like Optus, by COVID supply chain challenges, and by regulation like SOCI and APRA CPS 230, which are focused heavily on concepts of “business resilience”.
Seen in that light, the CrowdStrike outage is another data point on the same journey. This will continue to increase the focus on resilience, business continuity planning, and disaster recovery, encompassing all scenarios. If ‘dodgy update’ from my security service provider wasn’t on that list before, it will be now.
The CrowdStrike issue is not a case of a security company breaking the world. It is a more nuanced discussion around business resilience.
What can businesses and governments do to mitigate this kind of calamity?
The non-security answer is that you run “N-1” as an update strategy; that is, you’re always one version behind so that these issues get flushed out before you update.
In the context of security, you can’t do that. You can’t have an “N-1” strategy for security updates because it means you’re always exposed to the latest attacks. On the balance of risk, that’s unlikely to be the right decision.
This is the convergence of a range of different issues, from the duopoly of Microsoft and Apple in the operating system market, to the need for organisations (both security solution providers, and the end-users of that software) to move faster to get security patches and updates into production to mitigate the ever-increasing risk of being hacked.
On the one hand, homogeneity and standardised platforms make supporting them much easier and lower cost. On the other hand, homogeneity introduces significant concentration risk, as we saw last week.
A fully resilient organisation would have a fall-back for when any specific system is unavailable. But those fall-backs aren’t going to be “like for like”. For example, if the Point of Sale system for a retailer in a food court disappears they can still accept cash. That’s probably sufficiently resilient given how infrequent the outage should be, and the fact they’re selling noodles at lunchtime, not running a hospital – but obviously still has a big financial impact for the business.
Organisations need to look at their business processes (i.e check-in, luggage handling, customer identification, customer communications, safety and security monitoring, payments etc.) and seek to put in place a workable fall-back where one is needed. Some business processes you can do without for short periods of time. Other business processes you really can’t replace, in which case the focus needs to shift to avoiding this scenario. As an example, a business might have small-scale heterogeneous implementations (i.e a handful of Macs to complement a full Windows fleet, or a set of off-line systems that don’t receive updates, that can be activated when needed in this scenario).
But let’s not lose sight of the fact the CrowdStrike outage was a freak event. Similar outages will happen again, but they will be sufficiently different that you can’t just solve for this one. The next one might be a power outage. Or a water outage that impacts cooling systems and shuts down data centres. Or another issue with BGP (Border Gateway Protocol) or DNS (Domain Name System) or some other internet plumbing we know is there but largely don’t have to think about. Or, more likely, something completely unexpected.
There are things that can be done, but they all increase cost for what have been, so far, very infrequent events. The question is at what point they are frequent enough, or impactful enough, that spending money on resilience solutions you may never need is worthwhile. And each business and organisation out there, needs to do that analysis and answer that question.