May 17, 2026 · 7 min

The Cloud Didn't Simplify Infrastructure. It Redistributed the Complexity.

Patrick McClory

The cloud didn't make infrastructure simpler. It moved the complexity somewhere less visible and replaced some of it with operational surfaces you have to learn from scratch. The 'cloud is awesome' moment and the 'I have no idea how this actually works' moment are often the same capability viewed six months apart.

infrastructure platform-engineering operations engineering-culture

The Cloud Didn’t Simplify Infrastructure. It Redistributed the Complexity.

The pitch is compelling: move to the cloud and stop worrying about infrastructure. Managed services handle the hard parts. The networking is somebody else’s problem. The storage is abstracted away. The operational burden goes down and your team can focus on building things that matter.

Before I go further: I’m not against cloud. My career is built on it. I’ve run production systems at scale on AWS, I’ve designed architectures that wouldn’t have been possible without managed services, and I’d make the same calls again on most of them. The cloud is good at what it does.

What it doesn’t do is make infrastructure simple. It makes infrastructure different. Some of that difference is an improvement. Some of it is a trade you need to understand before you make it. The teams that get into trouble aren’t the ones who chose cloud. They’re the ones who treated “managed service” as synonymous with “understood” and discovered the gap at the worst possible moment.

Some of that is true. The cloud removes real classes of operational complexity. You don’t provision physical hardware. You don’t think about MTU unless you go looking for it. Managed databases handle replication and failover in ways that would have required serious engineering effort a decade ago.

But the complexity didn’t disappear. It moved. And in some cases it got replaced by entirely new operational surfaces that are different, not simpler. The teams that don’t notice the difference are the ones who get surprised when something goes wrong.

What the Abstraction Hides

MTU is a useful example precisely because most cloud engineers have never thought about it.

Maximum transmission unit, the largest packet size that can be traversed on a network path. In a physical network, this is something you configure, validate, and occasionally debug when things go wrong in specific ways. In a cloud environment, the underlying network infrastructure makes assumptions about MTU on your behalf and you never see them.

That abstraction is convenient. It removes a real operational concern. It also means that when you encounter a problem that traces back to MTU, and this happens in specific circumstances involving VPNs, tunneled connections, or hybrid architectures, you may not have the mental model to recognize it. The cloud abstracted away the problem and the understanding at the same time.

This is the double edge. The things the cloud hides from you are hidden in both directions. You don’t have to think about them until you do. And when you do, you’re starting from less context than someone who has had to manage those things directly.

This isn’t an argument against using managed services or cloud infrastructure. It’s an argument for knowing which assumptions you’re inheriting and what happens when those assumptions don’t hold.

The Operational Leap Problem

The other side of this is the operational leap, where the cloud capability is better but requires a different understanding to operate correctly.

Database backup is the clearest example I keep coming back to. Moving from traditional managed database dumps to WAL file archiving to object storage is a real improvement in almost every dimension. Continuous archiving. Point-in-time recovery. No maintenance windows for backup jobs. The capability is better.

It’s also a whole new world operationally. I’ve watched teams inherit WAL archiving as a “best practice” from a previous setup or a well-meaning handoff and operate it incorrectly for months without knowing. Not because the technology broke. Because the operational model is different from what came before and nobody built that understanding when the capability was adopted.

WAL archiving has its own failure modes. Restoring from WAL archives is a different process than restoring from a dump file. The archive chain has to be continuous. Gaps mean you can’t recover to points after the gap. The object storage bucket has to be configured correctly, the retention policies have to be right, the archiver has to be running and healthy. When something goes wrong you need to understand how WAL streaming works, what breaks the chain, and how to recover.

None of that knowledge transfers from “we run pg_dump every night.” And the dangerous part isn’t that it fails loudly. It’s that it can fail silently. An archive chain that’s been subtly incomplete for six months looks fine right up until you need to restore and discover the gap. At that point the backup you thought you had doesn’t exist in the form you need it.

The moment an organization moves to WAL archiving is often celebrated as an infrastructure improvement, and it is. What’s less visible is that the operational understanding required to actually use that backup capability in a recovery scenario is significantly more complex than what came before. The capability upgraded. The team’s operational familiarity with the new capability didn’t upgrade automatically.

This is the “cloud is awesome” moment and the “I have no idea how this actually works” moment occupying the same decision. You adopted the feature. You didn’t necessarily adopt the understanding of what to do when the feature fails.

Complexity Moves, It Doesn’t Disappear

The pattern across both of these examples is the same: cloud infrastructure doesn’t eliminate complexity, it redistributes it.

Some complexity gets abstracted away permanently and you don’t need to think about it. MTU in a standard cloud deployment is mostly in this category. The abstraction holds, the underlying behavior is consistent enough, and you can operate effectively without understanding what’s happening below the abstraction layer.

Some complexity gets abstracted away until it doesn’t. This is the MTU-in-a-hybrid-network scenario, or the managed database that behaves differently under specific load patterns, or the cloud networking construct that works fine until you add a VPN gateway and the path MTU changes. The abstraction was real and the abstraction broke. Now you need the understanding you never had to develop.

Some complexity gets replaced by new complexity. WAL archiving is better than dump files and it requires understanding a different operational model. Kubernetes is more capable than the deployment tooling it replaced and it has a learning surface that takes years to develop real depth in. Container networking is different from host networking in ways that matter under specific failure modes. The old complexity is gone. The new complexity is real.

The teams that navigate this well aren’t the ones who avoid the cloud or avoid managed services. They’re the ones who are honest about which category each abstraction falls into. They know which complexity they’ve shed and which complexity they’ve deferred or transformed. They build operational understanding of the capabilities they depend on rather than assuming the abstraction is the whole story.

The Question Worth Asking

Before adopting any cloud capability that abstracts away operational complexity, the question worth asking is: what does this look like when it fails?

Not as a reason to avoid the capability. As a forcing function for developing the operational understanding before you need it. WAL archiving is the right call. What does a broken archive chain look like and how do you recover from it? Managed Kubernetes is probably the right call for most teams. What does the control plane look like when something goes wrong and you need to understand it?

The question reveals which category the abstraction falls into. If you can answer it confidently, the abstraction is solid ground. If you can’t, you’ve adopted the capability without the understanding, which means you’re relying on the abstraction holding perfectly indefinitely. That’s not a reasonable operational posture.

The cloud makes some things easier. It makes some things different in ways that require new understanding. It makes some things invisible in ways that are fine until they aren’t. Knowing which is which is the actual infrastructure competence. The rest is just using the tools.

What This Means for How You Build

The practical implication is that “we use managed services” is not an infrastructure strategy. It’s a procurement decision. The strategy is what you understand about how those services work, what you do when they don’t, and which parts of the operational surface you’ve internalized versus assumed away.

Every managed service you adopt is a tradeoff between operational burden and operational control. You give up some control in exchange for having someone else manage the complexity. That’s often a good trade. It’s a trade you should make consciously, with a clear view of what you’re giving up and what you’re getting.

The infrastructure teams that are perpetually surprised when things go wrong in cloud environments are usually the ones who adopted capabilities without adopting understanding. The “cloud is awesome” moment was real. The operational model that follows from it wasn’t built.

The question isn’t where you run. It’s what you understand about where you run. That’s the same question on bare metal and in a managed cloud environment. The abstraction layers are different. The requirement for genuine operational understanding isn’t.

← back to all writing