March 18, 2026 · 6 min

What Done Looks Like in Infrastructure Work

Patrick McClory

Infrastructure work has a done problem that software delivery mostly solved and we haven't caught up. 'It's working' is not done. 'Nobody is complaining' is not done. And the cost of not defining done is that you never actually finish anything. You just stop actively working on it.

platform-engineering infrastructure operations automation

What Done Looks Like in Infrastructure Work

Software delivery teams have largely solved the done problem. Acceptance criteria. Definition of done. Automated tests that tell you when you’ve arrived. It’s not perfect, but the discipline exists and most teams that care about quality apply it.

Infrastructure work hasn’t caught up.

“It’s working” is not done. “Nobody is complaining” is not done. “The health checks are green” is not done. These are signals that the system is currently running, which is a different and smaller claim than done. The cost of confusing them is that you never actually finish anything. You just stop actively working on it, which is not the same thing at all.

The Health Check Problem

I have an application running right now that passes all health checks and readiness checks in a new environment. The logs are quiet. From every instrument the system exposes, it looks correct post-migration. It is functionally broken. Silently. The developer responsible for it is struggling to explain the failure mode in place.

This is not a rare situation. It’s the normal outcome of a done definition that stops at “is the service running.”

Health checks verify liveness: is the process alive. Readiness checks verify that the service is ready to receive traffic: is it past its startup phase, has it connected to its dependencies. Neither of those checks whether the application is actually doing what it’s supposed to do. Those are three different things and most teams only wire up the first two.

You can have a perfectly green status page and a completely broken application. The status page is not lying. The process is alive, it’s accepting traffic. It just isn’t processing that traffic correctly, or it’s losing data, or it’s returning results that are technically valid and functionally wrong. The health check doesn’t know. It wasn’t asked.

This is what happens when done means “instruments show green” rather than “the system does what it’s supposed to do.”

Running vs Working

The gap between running and working is where most infrastructure debt actually lives.

A service that’s running has a process, consumes resources, and responds to health checks. A service that’s working does what it was designed to do, produces the correct outputs, handles failure modes gracefully, and degrades in expected ways under load. Running is a necessary condition for working. It’s not sufficient.

Most infrastructure monitoring is built to detect the running state. Uptime. Process health. Resource consumption. Network connectivity. These are all legitimate things to monitor. They tell you the system is alive. They don’t tell you the system is doing its job.

The gap shows up most clearly during migrations and environment changes. A service migrated to a new environment can pass every automated check, containers healthy, dependencies reachable, endpoints responding, and still be operating differently than it was before. Configuration values that worked in the old environment don’t work in the new one. Network paths that were assumed are different. A dependency that was adjacent is now a hop further away and the timeouts are wrong. None of this shows up in liveness or readiness checks. It shows up when someone tries to actually use the thing.

The discipline that catches this is functional verification: testing not just that the system is running but that it’s working correctly. In software, this is called integration testing or end-to-end testing and it’s table stakes for a production deployment. In infrastructure, it’s often skipped because it’s harder to define and harder to automate. The result is environments that look done and aren’t.

What Done Actually Requires

Done in infrastructure has three layers and most teams only verify the first one.

The first layer is structural: is the system running, are the components healthy, are the dependencies reachable. This is what health checks and readiness probes cover. It’s necessary and it’s not enough.

The second layer is functional: does the system do what it’s supposed to do. For a network, that means traffic flows correctly between segments, routing decisions are what the design intended, and the firewall rules actually protect what they’re supposed to protect. For a storage cluster, that means data written can be read back, replication is happening, and the failure domain boundaries work as designed. For a deployment pipeline, that means a commit actually produces a running workload, not just a successful pipeline run.

The third layer is operational: can the people responsible for the system understand what it’s doing, respond to failures, and recover from known failure modes within acceptable time. A system that works when everything is fine but requires heroics when something goes wrong is not done.

Most infrastructure done definitions stop at layer one. Great infrastructure done definitions get to layer two. The teams that operate well over time build toward layer three.

Done Is a Named List

The practical implication of all this is simple, even if executing on it is work.

Done is a named, testable list of criteria that someone else could verify independently. Not a feeling. Not “it seems fine.” A list.

For the network bootstrap I described elsewhere: all devices up and reachable, VLAN 1 accessible on all switches, SSH confirmed to each device, IPMI confirmed on the router. That’s a list. Someone else could run through it. When every item was checked, I left. Not before.

For a migration: the service passes health and readiness checks in the new environment, the service processes a real request end-to-end and produces the correct output, the logs show what you’d expect to see from a healthy service, and the failure modes that were handled in the old environment are handled the same way in the new one.

If you can’t write the list in ten minutes, you don’t know what done is yet. That’s not a criticism. It’s information. It means the done definition needs to be built before the work starts, not discovered after the system is supposedly running.

Why This Matters Beyond Checklists

The done definition is not bureaucratic overhead. It’s the thing that tells you when you can trust the system.

Infrastructure that is running but not verified as working accumulates risk silently. Every day that passes is another day of unknown state: unknown whether the failure mode you didn’t check for is present, unknown whether the configuration drift that happened during migration has introduced subtle wrongness, unknown whether the system will behave correctly under the conditions that matter most.

The app I mentioned at the start of this post is running in production right now, passing its health checks, processing traffic, and failing silently for a class of requests that the checks don’t cover. The team knows there’s a problem. They don’t fully understand it yet. The cost of that, in engineering time, in user impact, in the cognitive overhead of operating a system you don’t trust, is real and it’s accumulating.

That cost was avoidable. Not by being more careful in general, but by having a done definition that included functional verification before the migration was called complete. The health checks passing was never going to be enough. Someone needed to say so before the deployment, not after.

Done is a decision you make before the work starts. It’s a specific picture of what you’re trying to produce that’s clear enough to test against. It’s the thing that lets you say, with confidence, that the work is actually finished.

Everything before that is still in progress. Even if the health checks are green.

← back to all writing