CrowdStrike Outage: Lessons on Fragility and Resilience
Written by: Ferhat Dikbiyik
On Friday, July 19, 2024, a routine CrowdStrike Falcon sensor update resulted in a global IT crisis, highlighting the fragility of our interconnected systems. Described by some as “the Y2K we never got,” this incident has underscored the societal impacts of digital failures. IT professionals worldwide have been working tirelessly to mitigate the damage and restore critical infrastructure, with manual interventions required until Microsoft’s repair tool was released on Monday, July 22.
Government agencies and cybersecurity firms have been active in providing guidance and addressing related cyber risks. Meanwhile, thousands of flights were grounded, surgeries and medical procedures were delayed, and 911 services were disrupted in several regions of the US. The CrowdStrike incident demonstrates the impact digital supply chains have on our society as a whole and the need for resilience and awareness of concentration risks within these fragile systems.
What Happened?
On July 19, 2024, a routine update to CrowdStrike’s Falcon sensor, designed to enhance security, inadvertently caused widespread disruptions. The update contained a logic error in Channel File 291, responsible for managing named pipe execution. This error affected Windows hosts running Falcon sensor versions 7.11 and above, leading to system crashes (BSODs) and boot loops.
The impact was immediate and far-reaching. Various sectors, including critical infrastructure like airports, hospitals, and emergency services, experienced significant operational downtime. Organizations found their Windows systems inoperable, necessitating extensive manual recovery efforts.
The societal impact was profound, with thousands of flights grounded, surgeries and medical procedures delayed, and 911 services disrupted in several regions of the US. This incident highlights how interconnected digital supply chains affect everyday life, revealing the broader societal consequences beyond IT departments. One glitch in our digital supply chain can send shockwaves through society, just like a single domino toppling an entire row.
Initial response efforts were hindered by the need for manual intervention to remove the faulty update. IT teams across the globe worked tirelessly over the weekend to restore functionality. CrowdStrike quickly identified the root cause and issued guidance on mitigating the problem. However, the scale of the issue required a more automated solution.
By Monday, July 22, Microsoft released a Windows recovery tool to assist in the automated removal of the problematic update. This tool allowed IT administrators to create bootable USB drives with a custom WinPE image to expedite the remediation process. Despite this tool’s availability, retrieving BitLocker recovery keys remained a critical step for successful system recovery.
Meanwhile, cybercriminals took advantage of the situation by distributing fake updates containing malware and data wipers. These phishing attempts added to the complexity of recovery efforts, highlighting the importance of using official channels for communication. As Black Kite, we also alerted our customers about possible phishing campaigns related to this incident.
Government agencies like CISA, the UK’s NCSC, Australia’s ASD, and Canada’s CCCS issued alerts to inform organizations about the incident and its remediation. Their guidance emphasized the importance of using official updates and being cautious of phishing campaigns.
The incident, while not a cyberattack, highlighted the vulnerabilities inherent in our interconnected systems and the significant risks posed by concentrating on a single vendor. It served as a stark reminder of the importance of robust backup and recovery protocols, vigilance against cyber threats, and the need for thorough testing and validation of updates before deployment.
The Aftermath
Tireless Efforts and Manual Workarounds
The immediate aftermath saw IT professionals working around the clock to restore affected systems. Manual interventions were necessary until the release of Microsoft’s repair tool on Monday, July 22, 2024. Kudos to these tireless IT teams who have been instrumental in getting critical infrastructure back online, many of whom had to navigate remote and distributed workforces adding to the complexity of the recovery effort
Government and Vendor Responses
Government agencies such as CISA, the UK’s NCSC, Australia’s ASD, and Canada’s CCCS issued alerts to guide organizations through the incident. These advisories emphasized the need to follow official remediation steps and be vigilant against phishing attacks exploiting the situation. CrowdStrike and Microsoft provided crucial updates and support, working on solutions to help organizations bulk-update their systems efficiently.
Amidst the crisis, some cybersecurity firms resorted to opportunistic marketing by setting up hotlines like XXX-NO CROWD to lure CrowdStrike customers away. Such tactics are distasteful and undermine the collaborative spirit needed to tackle cyber threats. As cybersecurity vendors, our common enemy is the threat actors, not each other. There’s a fine line between fair competition and exploiting a rival’s misfortune, and this behavior falls squarely in the latter category.
A New Dilemma for CISOs
The CrowdStrike incident has placed CISOs in a tough spot. Automatic updates, once a straightforward solution for addressing vulnerabilities, are now seen as potential risks to business continuity. The challenge is clear: how do CISOs advocate for necessary security updates while ensuring operational stability?
CISOs now need to communicate these complexities to the board effectively. Emphasize the dual need for security and stability, and explain how incidents like this underscore the importance of a balanced approach. Stress the impact of digital supply chain failures on societal functions and business operations to advocate for investments in resilience and risk management. Highlight the importance of robust communication channels with vendors, ensuring they provide timely and accurate information about updates and potential risks.
Identifying concentration risks within the supply chain is also crucial. Encourage CISOs to map out and communicate these risks, highlighting the need for diversified systems to prevent single points of failure from causing widespread societal disruptions. Using tools and frameworks to identify dependencies can help anticipate and mitigate the impact of similar incidents in the future.
Encourage open dialogue within the organization, fostering a culture that understands the delicate balance between security and operational continuity. By framing these discussions around real-world impacts, like the CrowdStrike outage, CISOs can build a stronger case for the necessary precautions and strategies.
Lessons on Fragility and Resilience
The CrowdStrike outage has highlighted terms like fragility, anti-fragility, and robustness. This incident demonstrates that managing concentration risks is not just about business continuity, but also about societal resilience. Organizations must regularly conduct table-top exercises to assess their resilience and prepare for potential disruptions. The interconnected nature of our systems means a single point of failure can have far-reaching consequences for society.
During this incident, Black Kite actively supported our customers by providing CrowdStrike-specific FocusTags™ to help identify and manage associated risks. We also guided the use of our Supply Chain Module to pinpoint vulnerabilities and mitigate potential impacts. Additionally, we alerted our customers to possible phishing campaigns related to this incident. For more insights, you can read our previous blog post, “Focus Friday: Lessons from the CrowdStrike Update Outage on Global IT Resilience.“
Looking Ahead
While this incident was not a cyberattack, it serves as a crucial reminder of the N-th party risk problem. Concentration risk around a particular vendor can lead to widespread disruptions, demonstrating the need for diversified and resilient systems. This lesson aligns with insights from our previous blog on global IT resilience, emphasizing the need for continuous vigilance and preparedness.
Let’s take the lessons learned from this outage as a preparatory exercise for potential cyberattacks. Effective communication, identifying concentration risks, and fostering a balanced approach to security and operational stability are key. By addressing these areas, we can enhance our resilience and readiness for future challenges. This outage, I feel, will be classified as a new type of disaster and will be part of disaster recovery exercises.
The incident has revealed how a single disruption in our digital supply chains can have widespread consequences. To mitigate such risks, it is crucial to build redundancy into our systems and ensure we have alternate paths to maintain continuity. By creating a more resilient infrastructure, we can better withstand future challenges.
For more in-depth insights and recommendations, refer to our detailed analysis in the FocusFriday blog post here.
Ready to see what Black Kite’s cyber risk detection and response platform can do for you?