On July 19, 2024, what should have been a routine update meant to improve CrowdStrike’s Falcon Sensor software ended up causing chaos. Instead of enhancing the endpoint detection and response system, the update resulted in Windows computers crashing spectacularly, displaying the dreaded “Blue Screen of Death.” This caused disruptions across the globe and across industries such as healthcare, aviation, banking and media.
CrowdStrike quickly stopped the faulty update distribution and provided a workaround that involved manually rebooting machines in safe mode and deleting a specific file. While this approach worked for some, it was less effective for large organizations with hundreds or thousands of affected computers. Eventually, CrowdStrike released a software patch to remove the faulty code, but it only worked for machines that could connect to the internet.
Impact Across Sectors
Interruption of some services has not been resolved 24 hours later. My brother in law’s flights were cancelled today. It’s still a mess. The global outage had far-reaching consequences in several sectors:
Healthcare: Imagine being in the middle of a critical surgery and the operating room computer system crashes. Hospitals in Austria, Germany, and Israel canceled surgeries, postponed appointments, and had significant difficulties accessing patient records.
Transportation: Airlines in Japan, Germany, and the United States cancelled flights and experienced long delays, with airports in some countries resorting to manual check-in processes and issuing handwritten boarding passes. Yes, you read that right – handwritten boarding passes. There were/are some extremely frustrated travelers.
Finance: Banks in India and Australia experienced disruption in services, making it impossible for customers to access banking applications and conduct credit card transactions. The UK stock exchange also faced issues, causing financial markets to struggle.
Retail: Stores in multiple countries had issues with payment processing, resulting in lengthy lines at checkout counters and unhappy customers – a retailer’s worst nightmare.
According to FedScoop, several government departments faced challenges because of the disruption:
The Social Security Administration (SSA) faced significant issues, resulting in the closure of all physical locations and prolonged wait times for callers to the national 800 number. The Department of Homeland Security (DHS), in collaboration with the Cybersecurity and Infrastructure Security Agency (CISA), worked to mitigate these system outages. The Department of Veterans Affairs (VA) experienced downtime with their Enterprise Service Desk, though healthcare services and veteran care remained unaffected. The Department of Justice (DOJ) was also impacted by the outage and actively worked on troubleshooting and finding workarounds. The Department of Education acknowledged phone difficulties. Various government agencies, including the DOE, NNSA, DOJ, Department of Education, FAA, NRC, Treasury Department, State Department, Administrative Office of the U.S. Courts, and HHS, collaborated with CrowdStrike, Microsoft, and other partners to address the system outages.
Single Points of Failure
A Single Point of Failure (SPOF) refers to a part of a system whose breakdown could result in the systems collapse. The CrowdStrike issue serves as an example of how a single faulty update can lead to disruptions. SPOFs can include hardware elements, software applications, network connections, and even human processes. SPOFs pose major risks in systems that demand high levels of availability, such as those found in healthcare, finance and critical infrastructure.
Spotting Single Points of Failure
It’s important to identify points of failure (SPOFs) because even a system that seems free of them can still encounter issues due to dependencies or unexpected events.
Here are 2 ways to spot these vulnerabilities:
- Systems Analysis
This method involves analyzing the entire system workflow to pinpoint weaknesses and bottlenecks that could result in system failures.
- Mapping Dependencies: Create a diagram illustrating all dependencies within your system architecture – encompassing hardware, software and third-party services. Identify components whose failure would have repercussions.
- Critical Path Identification: Pinpoint the critical paths in your workflows. These are the sequences of tasks or components that are essential for system operation. If any part of this process fails it could lead to a disruption in operations.
2. Failure Mode and Effects Analysis (FMEA)
Identify potential failures by examining each component and determining the likelihood and consequences of its failure. Risks are then prioritized based on their impact on the system, then issues are addressed related to components with high risk scores.
What the heck is a Failure Mode and Effects Analysis (FMEA)?
Not gonna lie, I’ve never heard it called this. It is a process that helps identify failure points in a system and assess their impact. I’m most familiar with it being the Risk Assessment that is a requirement in NIST SP 800-171.
To conduct a FMEA (aka Risk Assessment):
Start by listing all the ways a specific component in a system could malfunction. This involves brainstorming breakdown scenarios, such as losing a network connection, an incorrect operation or configuration by personnel, database corruption, a tsunami wiping out your data center, or hardware failure.
Evaluate Impact of Failures
For each identified failure mode, describe its potential repercussions. This requires understanding how the failure could affect components of the system and its overall functioning. Effects could be anything from loss of network connectivity, loss of access to applications and data, system downtime, loss of critical business data, and disrupted services to security breaches.
Determine Risk Levels
Evaluate the seriousness of each failure mode based on its consequences. This usually involves assigning risk levels or scores to each failure by considering factors like likelihood of occurrence and severity of impact.
1. Likelihood of Occurrence: How often the failure might occur based on historical data, current conditions, and industry benchmarks.
• Low: Rare occurrences
• Medium: Occasional occurrences
• High: Frequent occurrences
2. Severity of Impact: Evaluate the outcomes if the failure takes place, considering factors such as downtime, data loss, financial implications and harm to reputation.
• Low: Minor disruptions with negligible impact
• Medium: Moderate disruptions with manageable impact
• High: Severe disruptions with significant impact
- Determine Risk Level: Combine the likelihood and severity to determine the overall risk level. This helps with prioritizing risk mitigation actions and using resources efficiently.
Propose Mitigation Plans
Devise strategies to minimize the risk of each failure mode happening. This might include things like developing comprehensive testing protocols, introducing redundancies, establishing recovery protocols, training staff, or improving your incident response plans and regularly testing it to ensure quick and effective resolution of any issues that arise.
Lessons Learned
Avoid Single Points of Failure
Organizations should proactively identify and mitigate single points of failure (SPOFs) in their IT systems. This involves more than just adding redundancies; it requires a comprehensive analysis of both internal and external system dependencies. Critical components and dependencies should have backups through redundancy and failover mechanisms. Additionally, systems should be designed with reliable components and robust error handling. Establishing and regularly testing incident response plans and recovery procedures is essential. Setting up geographically distributed environments can reduce the risk of regional disruptions and proactive monitoring systems should be implemented to detect and address issues before they escalate.
Strengthen Software Update Protocols
Implement testing and deployment procedures for software updates. Testing, phased rollouts, rollback mechanisms, and continuous monitoring for anomalies are vital to prevent issues or to deal with them promptly. The outage underscored the need for comprehensive and rigorous testing of software updates, particularly those intended for critical systems. Even small errors can trigger cascading consequences in today’s connected world. Effective quality assurance processes, including extensive pre-release testing in controlled environments, are crucial for reducing risks.
Diversify Cybersecurity Solutions
Steer clear of relying heavily on single vendors or platforms by diversifying your cybersecurity solutions. Using multiple security vendors for different protection layers (defense in depth) and ensuring backup systems are in place can enhance resilience against primary vendor failures. When investing – don’t put all your money in one stock!
The incident exposed the risks of consolidating cybersecurity needs onto a single platform. While this strategy may offer cost savings and operational efficiency, it also creates a single point of failure. Organizations need to weigh the benefits of centralization against the importance of resilience and redundancy in their cybersecurity infrastructure. Think of it like putting all your eggs in one basket – if that basket breaks, you’re cleaning up a huge mess.
If you have a security system with one alarm, what happens when it fails? Organizations should rethink centralized security models. Putting trust in a single entity for cybersecurity functions can create vulnerabilities that adversaries may exploit. Organizations, especially those responsible for critical infrastructure, should investigate decentralized and resilient security models to reduce dependence on single vendors.
Transparency and Communication
Clear, transparent, and timely communication during incidents is crucial. CrowdStrike’s initial response, though acknowledging the issue, lacked a clear apology and detailed explanation, impacting customer trust. Providers should adopt a proactive and empathetic communication strategy, keeping clients informed about the root cause, remediation efforts, and preventive measures. This can calm nerves a bit when things aren’t going so great.
The Importance of Risk Assessments
By conducting risk assessments and taking an approach to mitigating single points of failure, organizations can enhance their systems resilience, reduce downtime, and ensure smooth business operations.
More Strategies to Reduce Risks
Monitoring and Detection Capabilities
Deploy tools for monitoring and detecting issues quickly by identifying anomalies and potential security threats. Keeping an eye on system logs, network traffic and user activities is essential for catching incidents early and containing them.
Culture of Security Awareness
Teach your kids not to talk to strangers. Foster a culture of cybersecurity awareness and responsibility among employees. Ongoing user education and awareness programs can significantly reduce risks and improving the overall security posture.
Incident Response, Business Continuity and Recovery Plans
Organizations must develop and regularly test comprehensive incident response and business continuity plans. Tests should simulate potential disruptions in critical IT services, including those provided by third-party vendors. Contingency plans for alternative solutions or workarounds are essential for operational resilience. You hope you never need the spare tire in your vehicle, but you’re glad it’s there when you do.
Hey, you made it to the end. Want a really cool Risk Assessment Template? I made one that has a risk register, mitigations – POAM items, and a Vendor Risk worksheet. You can download it by filling out this form.