David Ryan David Ryan

The 33% Oversight: What Data Centers Can Learn from the World’s Largest IT Outage

July 19, 2024, was the day the airports stopped working and screens went blue. This was the day of the great CrowdStrike outage. In its immediate aftermath, the industry didn’t so much panic as lapse into its well-rehearsed crisis choreography—fingers pointed, scripts recited, blame assigned by muscle memory. The technical autopsy was a little too quick, too formulaic, to be complete.

July 19, 2024, was the day the airports stopped working and screens went blue. This was the day of the great CrowdStrike outage. In its immediate aftermath, the industry didn’t so much panic as lapse into its well-rehearsed crisis choreography—fingers pointed, scripts recited, blame assigned by muscle memory. The technical autopsy was a little too quick, too formulaic, to be complete.

The industry quickly came to two conclusions: The programming language was unsafe and the update that caused the outage was rolled out recklessly fast. And if we stop there, we have a very neatly packaged way of seeing the world’s largest outage: rewrite the code and ensure updates are implemented incrementally through staged rollouts. Done. We can move on to the other matters. But, if we do that, we passively participate in the industry's most comfortable ritual: oversimplifying causality.

In the world of digital infrastructure, we live and breathe root cause analysis (RCA). We instinctively look for a single point of failure, precisely because it is manageable. It’s easy to isolate. Action items designed to prevent it from happening again are clear, straight lines: delete this line of code, update that tool. For as much as it’s a solution, it’s also a security blanket we collectively wrap ourselves in so we can sleep at night, pretending that if we just find that one faulty logic gate, the machine is fixed.

But, as anyone who has ever worked on a high-speed production line knows, a bearing doesn't just seize in isolation; it’s slowly overwhelmed by the conditions of the machine it inhabits. Often, its failure begins quietly with contamination creep—fine dust and microscopic particulates that slowly find their way through its seals. Over time, the bearing might be starved of lubrication or fatigued by overload, forced to carry a load it was never rated to handle. The first signs of failure register as quiet vibrations—small, anomalous logs dismissed as background noise. Then comes the squeaking—the warnings our systems are either designed to spot or ignore. Finally, there’s the hard seize: the moment the lubrication fails, the heat spikes, and you hear the ear-bleeding hallmark of destruction—the screeching cacophony of metal on metal.

 

 

When World-Class Systems Stop Working

On July 19, the world’s machinery didn’t just falter; it seized like a dry, overloaded bearing, as we watched $10 billion vanish into 8.5 million unresponsive Windows machines, many displaying the hallmark blue screen of death (BSoD). In his post-incident analysis, Microsoft’s David Weston seized on the disaster to champion memory-safe languages like Rust. In older languages like C++, a program can accidentally look for data in the wrong neighborhood of memory—creating a read-out-of-bounds error—which is what caused the 8.5 million Windows machines to crash. This wasn't a new argument; it was a loud vindication of CISA’s earlier warning that the industry’s reliance on memory-unsafe code is a ticking time bomb. Big Tech never lets a good crisis go to waste, and Weston used the disaster to pivot directly to Microsoft’s commitment to rewrite the Windows kernel in Rust. The move revealed a focus that felt both undeniably correct and inward.

The most piercing critique, however, centered on the absence of staged rollouts. While outlets like The Verge initially scrambled to explain what happened, Reddit’s threads like r/sysadmin were already screaming about the why. The question reverberating across the industry was simple: Why was this update pushed to the entire fleet at once? Here, fleet refers to the massive, interconnected sea of machines running CrowdStrike’s code. The answer to this question finally arrived in CrowdStrike’s own Root Cause Analysis (RCA), where they surrendered the fatal detail. As Forbes reported, the failure wasn't simply the malformed file; it was that CrowdStrike lacked a series of staged rollouts (often referred to as a canary system) for this specific update channel. While CrowdStrike did have a process for staged rollouts, its own system logic looked at the rapid response flag—think of it as a VIP pass allowing certain files to bypass inspections—and followed its own rigid logic, building an express lane that effectively jumped over the very safety railings designed to protect the network. In other words, the internal gatekeepers and the delivery mechanism functioned exactly as designed: the system simply bypassed them under the structural assumption that 1) velocity was needed and 2) the update, being configuration data (rather than code) couldn’t cause computers to crash.

When these two flaws came together, they formed a perfect storm. The operator—likely an unfortunate engineer running a routine sequence—didn't just flip a switch; they unknowingly forced the final, fatal revolution of a machine already burdened by structural neglect. The result wasn't just a quiet digital silence, but a planetary-scale cold weld, where the friction of a single piece of mismatched logic temporarily fused much of the world’s critical systems into a motionless, blue-screened heap of industrial shrapnel.

True, Rust is better material than C++. It’s like swapping a wooden beam for an I-beam. It keeps the building from rotting inside out (preventing memory corruption and null pointer dereferences). If this had been written in idiomatic Rust, the blast radius would likely have been 8.5 million warning logs and zero crashes. The virtue of Rust isn't simply its durability, but its uncompromising gatekeeping at the compiler level. By enforcing a hard-stop on memory-unsafe patterns at compile-time, Rust essentially forces the person behind the keyboard to logically reconcile their intent before the code can function.

Despite Rust’s features, critics might argue that using a Rust kernel doesn’t necessarily ensure all code is written in idiomatic Rust. Even after a kernel is rewritten in Rust, there are still ways to bypass the language's natural safety checks and design patterns. Rust does a lot, but it cannot bridge the logic gap, and an architectural mismatch could technically still cause a crash. For example, the logical mismatch that triggered the CrowdStrike outage was a content validator expecting 21 inputs while an interpreter provided only 20.

This highlights an important point—Rust doesn’t solve the root issue, but it does provide crucial containment. Similarly, canary rings (staged rollouts) are another form of containment. Both prevent issues from scaling to a global disaster, and yet neither fix the underlying cause, which would likely have continued to quietly exist until someone eventually tripped over it.

 

The Paradox of the Human-in-the-Loop

This brings us to an important oversight—while the lack of canary rings and the kernel language explain how the crash spread globally, they don’t account for how the human who triggered this event would or should have had any idea of the catastrophic chain reaction they were about to unleash. After all, CrowdStrike had followed their established protocol to the letter. The company didn’t crash by accident—it followed its own procedurally correct instructions, rendering the engineer who triggered this incident powerless.

In high-stakes systems, we often treat the human-in-the-loop (HITL) as a liability to be automated away, but the truth is that people are the most critical fail-safe we possess. Tragically, if design and documentation don’t account for this, the human often becomes a victim of the architecture. True HITL architecture isn't about having a person click a button—it’s about providing a person with the information, context, and agency necessary to act as a corrective force against machine logic.

Gradually, our internal systems—our deployment scripts, back-end dashboards, and rapid response loopholes—have accumulated technical debt, slowly calcifying layers of prehistoric logic together. These ancient, sedimented fossils of “fixed-for-now” have been paved over so many times they’ve formed into subterranean mountain ranges, a landscape of jagged, invisible outcrops and hidden fault lines, just waiting for some poor, over-caffeinated soul to snag a boot and trigger a tectonic shift that folds the very floor of the on-call response center into the abyss.

While the CrowdStrike outage was a software seizure, a nervous system failure in the cloud, it’s important to remember what the cloud actually is—a collection of windowless brick buildings full of humming fans and miles of hair-thin, laser-pulsed glass. These buildings—the data centers—do the heavy lifting of the digital age, and they are becoming increasingly complicated.

the Uptime Institute’s data suggests that while we have nearly perfected the hard infrastructure of N+2 redundancy in our data centers, nearly 40% of organizations have suffered a major outage linked to human error in the last three years. While the gross frequency of outages has trended down, the statistical likelihood of an event being triggered by a human has gradually trended up. This isn't an isolated observation; The Register recently noted that while outages are less frequent, between two-thirds and four-fifths of major wobbles involve “some element of meatbag-related cause." Both sources echo the industry’s general diagnosis—blame rapid growth, regional staff shortages, and a simple failure for humans to follow documentation. Additionally, they suggest the best cure is providing real-time ops support and increased staff training.

Yes, training and support help. Yet, this is where the industry's language and prognosis simultaneously illuminate and obscure. While human error provides an important way to categorize causality, such terms sometimes conceal the deeper ways humans and machines interact. When our vocabulary reduces complex failures to simple labels, it doesn't just describe the problem—it shapes how we perceive it, often limiting our ability to see the complete picture. When we say an outage was caused by human error, we unconsciously absolve the architecture, forgetting that amateurs aren’t operating these critical, high-pressure environments—elite professionals are.

 

The 33% Oversight

When a single hand accidentally turns off the world’s digital infrastructure, it can surely be due to a lack of training or competence, but just as likely it’s because a system slowly, over the course of years, has embedded traps in itself, layered together like digital Tiramisu—each layer of legacy code soaked in the bitter espresso of "quick fixes," until, eventually, the whole structure becomes too heavy and collapses into a soggy, structural mess—one that usually resists the neat, taxonomic classification so often found in post-incident reports.

The point is that while failures often have linear causes, they usually occur in messy interconnected systems. When we stop to acknowledge that, our statistics begin to tell a more nuanced story. According to Uptime Institute’s 2025 Annual Outage Analysis, between 66% and 80% of all major data center outages involve a human or, as The Register more bluntly characterizes it, includes some amount of a "meatbag-related cause." Let’s take the middle of that range—say 73%—to estimate a rough average. The Uptime Institute also reports that of these incidents that include a human, 45% are explicitly attributed to faulty procedures or processes.

When you do the math, multiplying the human factor (73%) by the process-failure rate (45%)—the resulting figure indicates that about 33% (about 32.85% to be exact) of all major data center outages globally (about one out of every three) are the result of a human involved in a faulty process or procedure that was authorized and deemed to be correct at the time of the outage. The CrowdStrike outage is eerily similar in this regard. The engineer wasn't a rogue agent nor an amateur cutting corners; they were an elite professional who simply had the misfortune of being in the wrong place at the wrong time.

Recently, I have come to understand this as the 33% Oversight—this is the percentage of outages where human understanding and machine execution diverged. It’s the space where we inadvertently engineered the human's ability to reliably be a fail-safe out of the loop. It’s where internal tools aren’t designed with the operator's cognitive load in mind, deployment interfaces lack the critical insight humans need to recognize a looming systemic failure before it happens, and system architecture fragments across a dozen different sources.

The 33% Oversight is less of an absolute constant, and more of a starting point. After all, this percentage depends heavily on a variety of factors (like design and system complexity) that are constantly in flux. And conservatively, 33% is likely just the floor. It doesn't account for the cases where processes and procedures were correct but the human who failed to follow them was suffering from cognitive overload or navigating a system so fragile that a single typo became a global kill-switch. It also ignores many of the cases where the "meatbag" was simply drowned by a poorly designed interface or a cacophony of non-actionable signals.

When the industry says the rise in these errors is a consequence of rapid growth, I suspect what is really happening is that digital infrastructure has become so sprawling and complex that the people, our most specialized, are having trouble navigating it. And rather than focusing on how to make that machine more intuitive for the humans operating it, we instead obsess over the resilience of the machine—the power systems, environmental controls, and the automated failovers it uses—while ignoring the logic, clarity, and honesty of its many interfaces.

At some point, we need to learn that where there is human error there’s often architectural incoherence. I write this at a time when the digital age is becoming so vast that no single human can holistically comprehend it. Our infrastructure isn’t currently designed to respond to this problem—a problem that resembles E.M. Forster’s prophetic vision in The Machine Stops, where a society forgets how their world-sustaining Machine works and begins to treat its maintenance as a holy ritual rather than a technical task. As we increasingly trust mechanisms that hum tunes we no longer recognize, we stop being able to holistically understand what we’re doing, and maintenance too inevitably begins to resemble ritual.

Modern architectures pile software-defined networks, globally distributed data fabrics, and complex orchestration layers into a towering architecture of abstraction. Within this framework, digital infrastructure manages millisecond-latency demands through a fractured set of strategies. While some architectures rely on the raw performance of bare-metal clusters and edge computing nodes to shave off every microsecond, others leverage cloud platforms that spawn lattices of microservices and serverless functions—automated virtual instances and containers that often spin up and tear down in seconds. This volatility compounds across the hundreds of discrete services maintained by major cloud providers, effectively creating a world of systems within systems within systems. Now, there’s a relentless push for AI integration that adds a layer of non-deterministic chaos to a stack with unknown weak points—fragility that more or less resembles Schrödinger’s cat—stable until the moment something reveals how, under just the right conditions, it isn’t. Much of this is being housed in the sprawling digital estate of massive, windowless data centers—those silent, humming cathedrals of silicon—all interconnected by a global web of things.

Whether we effectively manage these massive server farms or simply find ourselves inhabiting them increasingly hinges on how they are designed for human interaction.

Considering this, good design can no longer be a gift reserved for the end user. Now more than ever, internal users (DevOps, NetOps, rack-and-stack techs, critical ops, data center ops, facility managers, SecOps, SREs, sysadmin, and platform architects, to name a few) must be truly accounted for in the design and documentation process. If, for example, the internal tools a sysadmin uses to manage a sprawl of disparate services are unintuitive, outdated, or lacking in basic guardrails, then the disaster is already embedded in the code’s inherent structure, until one day we find ourselves no longer managing a network, but attempting to read tea leaves in a storm we helped create.

As James Reason and Charles Perrow have proven, catastrophic failures rarely stem from a single error but rather occur when multiple weaknesses align—like holes in Swiss cheese slices lining up perfectly. Each layer of defense has its imperfections, but normally these gaps sit at different positions, preventing disaster. It's only when organizational oversights, technical flaws, procedural gaps, and human limitations all momentarily align that the arrow of failure finds its path through the entire system.

 

The Architecture of Complacency

As I find myself deep-diving recent outages, I am increasingly seeing that 33% Oversight—that architectural delta between human understanding and machine execution—as a familiar metapattern. Although it usually isn’t one of primary actors, it is an elusive and recurring character in the drama of downtime. Ubiquitous and often hidden in plain sight, it’s surprisingly protected by the industry’s most pervasive shibboleths.

Take Amazon CTO Werner Vogels’ famous axiom: "Everything fails, all the time." It is a brilliant call to resiliency—a battle cry for building systems that can survive the inevitable. But over time, the industry has twisted this wisdom into a kind of complacency, treating failure as a law of physics rather than a design challenge, which helps explain why the overall occurrence of data center outages has decreased while the percentage of outages that involve humans have increased. It seems that by accepting failure as a universal constant, we inadvertently glossed over the entropy that builds up in our own internal tools, processes, and documentation.

Then there is the mythology of scale. When Amazon CEO Andy Jassy talks about the sheer, incomprehensible scale of global cloud infrastructure, reminding us that "People don’t understand how incredibly large the challenge is" he’s not wrong, but this sentiment has an unspoken corollary: it implies that the machine is now so vast it is effectively unknowable. It frames complexity as a force of nature, suggesting that only an inner circle can possibly navigate it, and even then, only by following the right ritual.

When we treat our infrastructure with the same superstitious awe our ancestors held for the harvest, we stop focusing on the tools we actually use to maintain it. Interfaces and documentation begin to feel incidental—tools that are more or less expected to fail in the face of some looming, universal constant. Sometimes, that means we stop questioning legacy decisions. Why, for example, was Microsoft still using C++ for a critical kernel driver in 2024? I can only imagine that at a certain scale the cost of change is believed to be greater than the cost of a managed risk.

This reveals the strange and intoxicating intersection of complacency and complexity. While complacency convinces us that our tools are good enough, complexity ensures we'll never truly know they aren’t until they fail. Yet, like the subtle vibrations of a bearing before catastrophic failure, warning signs are often there, if only we take the time to listen to them. Experts had been sounding the alarm on memory-unsafe drivers for years, yet the industry chose to focus on building bigger shock absorbers (resiliency) instead of fixing the broken axle (design). We optimized for the "everything fails" mantra, assuming our automated failovers would catch us, but what do you do when those failovers are all built on the same calcified logic as the systems they were meant to save? When you look at the economics, the managed risk of C++ looks like a historic blunder: the July 19 outage cost the Fortune 500 alone an estimated $5.4 billion. Compare that staggering vaporization of value to the investment required to move core drivers to Rust or to implement a rigorous schema-validation handshake years ago. We spent billions to survive a crash that we could have spent millions to prevent, proving that the true cost of technical debt isn't paid in interest—it's paid in catastrophe.

 

The Five Pillars of High-Stakes Information Design

The CrowdStrike incident teaches us a crucial truth: humans aren't a liability to be removed from the equation. When a system inevitably fails, when all the holes in the Swiss cheese line up perfectly for a disastrous moment, humans become our strongest safeguard. Humans aren’t a risk to be engineered out. Rather, they are the ultimate semantic validator. For this reason, closing the 33% Oversight requires an intentional commitment to making the machine's reality descriptively accessible to the humans who operate it. Bridging this gap requires a foundation built on five core pillars of high-stakes information design.

  1. Mandate accuracy as the non-negotiable baseline: Accuracy determines whether our tools and documentation provide a clear window to realityIf one percent of critical data is wrong or missing, the entire system is at risk. Establishing a definitive, verified source of truth is the first step in all design and documentation lifecycles. 

  2. Optimize information density and signal-to-noise filtering: When a system is failing, an engineer doesn't need a 50-page manual; they need a single, granular data point that tells them which switch to flip, what will happen when they flip the switch, and what they must do next. Information density and signal-to-noise filtering means putting the most critical information exactly where it needs to be at the right time. Even perfectly accurate, relevant data becomes functionally useless if it is buried under a mile of fluff rather than being delivered at the right time and with the detail needed to make a split-second decision in the middle of an emergency. 

  3. Enforce intuitive design and information architecture: We must build interfaces and knowledge structures that are not only accurate but also mirror the mental models of the people using them. If the navigation of a dashboard or the hierarchy of a document doesn't align with how users actually think about the subject matter, then these tools are effectively designed to get users lost. Additionally, proximity makes meaning in design and documentation. Perfect information presented through confusing non-intuitive architecture, while technically right, often becomes effectively wrong when operators can’t find what they need or AI agents hallucinate and provide inaccurate information. 

  4. Hardwire documentation into the engineering lifecycle: Documentation is sometimes overlooked and thought of as being secondary, and yet it is absolutely crucial to operations. Effective documentation starts with identifying its audience (personas) and their specific use cases. It provides the context, instructions, and mental models needed for data center personnel to navigate their complex world. Standardizing documentation forces personnel to identify and develop best practices—which is a critical step towards streamlining and automating internal tooling. Additionally, effective documentation integrates content reuse, document management, style guides, and information architecture, making content easy to find, useful, and manageable. When documentation is engineered this way, it becomes a force multiplier for uptime and safety. When it’s ignored, you're left with a 50,000-document doc repo, a Gordian knot of disconnected and disorganized documents, many of which provide different guidance on the same topics and contain obsolete data that will inevitably ensure your team will be reactive, blind, and prone to the very 'human errors' the system was designed to prevent.

  5. Implement and trend feedback loops: We need to develop feedback loops that continuously listen to the people who actually use our systems. When engineers solve a problem during an outage or during daily rounds, their findings need to be fed back into the system itself. This doesn’t merely include record-keeping. It includes integrating data from multiple feedback channels, identifying common trends, analyzing those trends, and creating strategic solutions that holistically address audience needs and mitigate issues before they become outages and failures. Such data-driven approaches provide a continuous cycle for us to extract and codify the hard-won knowledge of subject matter experts (SMEs) and frontline users, transforming their insights into the structural guardrails of the next generation.

 

Closing the Gap: From Oversight to Insight

This brings me to my final point. The 33% Oversight is not an inevitability nor an immutable constant—it’s a symptom of architectural failure. By balancing our focus on the resilience of the machine with the intuitiveness of its design, the descriptive clarity of its interfaces, and the usefulness of its documentation, we can bridge the gap between what the human sees and what the system does. Doing this not only prevents future downtime, it gives us a highly detailed mental model of our infrastructure that unlocks earlier issue detection, faster response time, and the agility to scale without increasing technical debt. All of this will be increasingly important as our digital infrastructure continues to expand and become more complex.

 

Read More