The CrowdStrike Outage and the Fragility of Modern Software

On July 19, 2024, CrowdStrike pushed an update to its Falcon endpoint protection software. The update contained a defect that caused the affected Windows machines to crash on boot, displaying the blue screen of death and entering a reboot loop. Approximately 8.5 million Windows machines globally were affected. Airlines could not check in passengers. Hospitals had patient management systems offline. Bank transactions were delayed. Emergency services in some jurisdictions were disrupted.

The outage was not a security event. It was a software quality event with security infrastructure as the contagion vector. Falcon was deployed as a kernel-level component on Windows machines, which meant that defects in Falcon could affect Windows boot reliability in ways that ordinary application defects could not. The deployment model, where updates flowed automatically to all Falcon installations from a central source, meant that one bad update could affect every customer simultaneously.

The recovery was painful. Affected machines required manual intervention in many cases, with administrators booting in safe mode, deleting the offending update file, and restarting. For organisations with thousands of affected machines, the recovery work consumed significant time. Some organisations reported being unable to fully restore normal operations for days.

What the incident revealed was a category of risk that had been increasing in modern software supply chains for years. The pursuit of speed in delivering updates, particularly for security-related software where fast updates are valuable, had produced architectures that did not have the safeguards normally associated with critical infrastructure. The same automatic update mechanism that allowed CrowdStrike to push security fixes to its customers within hours also allowed it to push a defect to all those customers within hours.

The post-mortem analysis that CrowdStrike published described a chain of failures in the testing and deployment process. The defective update had not been adequately tested in the configurations where it would actually run. The staged rollout that should have caught the problem before it reached most customers had not functioned as designed. The combination of those failures with the kernel-level deployment of the affected component had produced the global impact.

The broader lesson, which the industry had encountered in different forms before, was that software systems with global immediate reach require corresponding global immediate quality discipline. The CrowdStrike outage was a clear reminder that this discipline was not always present, even in software whose primary job was protecting other software.

The CrowdStrike Outage and the Fragility of Modern Software

Explore more from the archive

Related Articles