April 7, 2026 · 6 min

You Can't Outsource Understanding

Patrick McClory

You can delegate the work. You can use managed services. You can hire people who know the thing you don't. What you can't do is outsource the comprehension. When something breaks at 2am, the understanding either exists or it doesn't.

platform-engineering operations infrastructure leadership

You Can’t Outsource Understanding

You can delegate work. You can use managed services and abstractions and platforms built by people who know things you don’t. You can hire engineers who have deeper expertise in specific domains than you do. All of that is legitimate and often the right call.

What you can’t do is outsource the comprehension.

The work gets done either way. The system runs, the deploys happen, the infrastructure serves its purpose. But the understanding of how it works, what it depends on, what happens when it doesn’t. That either lives in the team or it doesn’t. And the moment it doesn’t is usually the moment you need it most.

The Mount Timing Problem

Early in my Kubernetes work, early enough that the documentation was thin and the behavior wasn’t always consistent with what the spec described, I ran into a problem with persistent volume mounts and AWS EBS.

The Kubernetes storage layer at the time had behavior that didn’t map cleanly to how EBS volumes actually attached and detached. There was no positive verify step in the mount workflow: no confirmation that the volume had actually mounted successfully before the workload tried to use it. We assumed it worked because it usually did. That assumption was load-bearing.

When it broke, the debugging path went all the way down. Storage APIs. The Kubernetes storage subsystem. EBS attach and detach behavior. The gap between what the spec said should happen and what was actually happening in practice. It was deep, slow, and expensive in engineering time.

The answer, after all of that, was a mount timing issue. The volume hadn’t finished mounting before the workload tried to use it. A verify step, a simple confirmation that the mount had completed successfully before proceeding, would have surfaced this in thirty seconds.

The depth of the debugging was proportional to the gap in the understanding, not to the complexity of the actual problem. The problem was simple. The understanding gap was large. Those two things together produced an expensive debugging session that a positive verify step would have prevented entirely.

How the Gap Forms

The gap in this case wasn’t laziness or carelessness. It was the normal state of working with something new in a space where the abstractions were still settling and the happy path was reliable enough that the failure modes weren’t visible yet.

Early Kubernetes was like this. The behavior wasn’t always what you’d expect from the documentation. The AWS EBS integration had specific assumptions baked in that weren’t obvious unless you’d hit the edge cases. The reasonable assumption, that volume mounting worked the way it looked like it worked, was correct often enough that nobody had reason to add the verify step.

That’s how outsourced understanding forms. Not through negligence but through the reasonable extrapolation from incomplete information. You assume it works like you think it works because it usually does. You build on that assumption. The assumption becomes load-bearing without anyone deciding it should be.

The gap isn’t obvious until it is. The system holds until it doesn’t. And when it doesn’t, the path back to understanding runs through every layer of the abstraction you trusted without fully understanding.

The Abstraction Tax

Every abstraction you use carries what I think of as an abstraction tax: the cost of not understanding what’s happening below the layer you’re working at.

Most of the time the tax is zero. The abstraction holds, the behavior is predictable, and the fact that you don’t know exactly how it works doesn’t matter. This is the whole point of abstractions: they let you work at a higher level without carrying the complexity of every layer below you.

But the tax becomes real the moment the abstraction breaks or behaves unexpectedly. At that point, your ability to diagnose and recover is bounded by the depth of your understanding. If the abstraction is your floor, if you have no model of what’s happening below it, then a problem below the abstraction is effectively invisible to you. You can observe its effects. You can’t reason about its cause.

This is what happened with the volume mount problem. The Kubernetes storage abstraction was the working layer. When the behavior didn’t match expectations, the debugging required going below that layer: into the storage APIs, the mount subsystem, the EBS attach behavior. That depth of understanding wasn’t there yet because it hadn’t been needed yet. The abstraction tax came due all at once.

The tax doesn’t go away by choosing better abstractions or more reliable managed services. It goes down, sometimes significantly. The failure modes are less frequent. The behavior is more consistent. But the tax is still there, and when it comes due in a managed service it’s often more expensive because the abstraction is deeper and the path to the underlying behavior is longer.

What This Means in Practice

The practical implication isn’t “understand everything before you use it.” That’s not possible and it’s not the point. The point is to be honest about where your understanding ends and to account for that in how you operate.

A positive verify step is one version of this. Don’t assume the thing worked. Confirm it. The cost of the confirmation is low. The cost of finding out it didn’t work after you’ve built several more layers on top of the assumption is high.

Runbooks that go one layer deeper than you expect to need are another version. If you’re operating a system built on a managed database, your runbook for “database is behaving unexpectedly” should go at least one level below the managed service interface. Not because you expect to operate at that level routinely. Because you need to know that you can if you have to.

Load testing failure modes, not just performance. Understanding what happens when the dependency you rely on is slow or unavailable, not just what happens when it’s healthy. Building operational familiarity with the tools and systems you depend on before you need to diagnose them under pressure.

None of this eliminates the abstraction tax. It lowers the effective rate by building understanding incrementally rather than all at once under the worst possible conditions.

The 2am Version

The clearest version of this principle is the 2am version.

Something is broken. The system is behaving in a way that doesn’t match any pattern you recognize. You have incomplete information and time pressure and the mental overhead of having been woken up. The question isn’t whether you can reason your way to the answer given enough time and a clear head. The question is whether the understanding exists in a form that’s accessible under those conditions.

Understanding that has to be reconstructed from first principles at 2am under pressure is not operational understanding. It’s theoretical understanding, which is better than nothing and worse than what you need.

The teams that operate well under those conditions aren’t the ones who never have incidents. They’re the ones who have built operational familiarity with their systems: who know where the bodies are buried, which parts behave unexpectedly under specific conditions, which assumptions the system makes that aren’t documented anywhere. That knowledge comes from operating the system, debugging it, running into its edges, and paying the abstraction tax early enough that it’s affordable.

You can’t shortcut that. The managed service doesn’t give it to you. The vendor documentation doesn’t give it to you. The team that built the system before you doesn’t give it to you. Not fully, not in the form you’ll need it.

You build it by operating. By adding the verify step. By going one level deeper than you had to. By paying the tax in small installments rather than all at once when you can least afford it.

The understanding either exists or it doesn’t. At 2am, you find out which one.

← back to all writing