April 11, 2026 · 6 min

The Operational Surface Is the Cost Nobody Counts

Patrick McClory

Most architecture evaluations compare tools in isolation. The better question is: what's the right tool given the operational surface I'm already committed to? Adding a new tool has a real cost that almost never shows up in the analysis. Reusing what's already there has a real value that almost never gets counted.

platform-engineering infrastructure engineering-culture operations

The Operational Surface Is the Cost Nobody Counts

There’s a concept called girl math, the logic that explains how store credit is essentially free because you already paid for it, or how buying something on sale at 40% off means you saved money even if you spent money. I say this as someone who completely understands the appeal: when you’ve already committed to a cost, anything that comes along for the ride registers differently than something new.

Infrastructure math follows the same logic, and in the best cases it’s even more favorable. Not buy-one-get-one. More like deploy-one-get-three.

The Ceph RGW decision is the example I keep coming back to.

The Reasoning Chain

I needed object storage for the platform. Document storage for Open WebUI, backup target for CloudNativePG via the Barman Cloud plugin, local backup storage generally. S3-compatible, reasonably performant, something I could trust.

The evaluation went like this in roughly this order:

MinIO is cool. I’ve used it, I like it, the developer experience is good. But their licensing got weird a few years back and for a production setup I want something I’m not going to have to reevaluate every time their terms shift.

MinIO on top of Ceph would be absurd. You’d be running an S3-compatible object storage layer on top of a distributed storage system that already has a native S3-compatible object storage interface. That’s not layering for capability. That’s layering for familiarity.

Ceph RGW is right there. It’s deployable via the Rook-Ceph Helm chart I’m already using. The primary reason I’m deploying Ceph at all is storage clustering: RBD for block, CephFS for shared filesystem. RGW for object storage is a component of the same system. I’ve been running RGW in production in a few places and I’m confident in it operationally.

The thing that sealed it: CNPG and the Barman Cloud backup plugin need somewhere to put backups. That’s the primary justification for deploying RGW. Everything else, the document storage for Open WebUI, the general backup target, comes along for the ride. That’s not a compromise. That’s infrastructure math.

It’s basically free. Not free in the sense of zero cost or zero configuration. Free in the sense that the operational surface is already committed. I’m already running Ceph. I’m already responsible for understanding it, monitoring it, upgrading it, debugging it. RGW doesn’t add a new surface. It adds capability to an existing one.

What Operational Surface Actually Costs

The phrase I keep using is operational surface, and it’s worth defining clearly because it’s the thing that almost never shows up in architecture evaluations.

Operational surface is everything you’re responsible for understanding about a system in production. Not just “does it run” but: how does it fail, what do the logs look like when something is wrong, what does the upgrade path look like, what are the edge cases that bite you at scale, what does the person who inherits this need to know to operate it safely.

Every tool you add to your stack adds to that surface. Some additions are worth it because the capability they provide justifies the operational cost. A lot of additions happen because the evaluation compared tools in isolation rather than accounting for what you’re already carrying.

When you evaluate MinIO against Ceph RGW in isolation, MinIO might win on developer experience and documentation quality. When you evaluate them in the context of a stack that already includes Rook-Ceph, the calculus changes entirely. MinIO adds a new operational surface. RGW extends an existing one. Those are not equivalent costs.

This shows up everywhere. The team that adds a third monitoring tool because it has a slightly better dashboard for one specific use case is probably not accounting for the cost of now running three monitoring stacks. The team that deploys a separate message queue for a use case that their existing database could handle isn’t counting the operational surface of the new queue. The team that reaches for a managed service when they already have the capability in-house isn’t counting the integration surface of the external dependency.

None of these are necessarily wrong decisions. They’re decisions that are often made without counting the full cost.

The Stack Already Has Opinions

The other thing worth saying about the RGW decision: the stack I’ve committed to already has strong opinions about how object storage should work.

Rook-Ceph is the operator for the whole storage layer. It manages OSDs, it manages the CephFS filesystem, it manages RBD pools. Adding RGW to what Rook manages is one CephObjectStore custom resource. The operator already knows how to deploy it, how to scale it, how to handle failures. The monitoring integration with the existing Prometheus stack is already there. The backup and restore procedures for Ceph cover RGW along with everything else.

This is what I mean by the stack having opinions. When you’ve made a foundational choice, subsequent decisions that work with that foundation rather than around it inherit a lot of value for free. The tooling, the operational knowledge, the runbooks, the monitoring. All of it extends to the new capability without additional investment.

The inverse is also true. Decisions that work around the foundation rather than with it have to build all of that from scratch. MinIO would need its own monitoring integration, its own backup procedures, its own operational runbooks. Not because MinIO is bad. Because it’s a different system with a different operational model.

How to Actually Count the Cost

The practical version of this: before adding anything new to a production stack, ask what it would cost to operate independently versus what it would cost to extend what’s already there.

Independent operation cost: new monitoring integration, new runbook, new upgrade procedure, new failure mode to understand, new thing to explain to whoever inherits the system. These aren’t hypothetical. Every one of them takes real time when something goes wrong or when the system needs to change.

Extension cost: how much of the existing operational investment carries over? If the monitoring is already there, if the backup procedures cover it, if the upgrade path is managed by an operator you’re already running, the extension cost might be close to zero.

The gap between those two costs is the real value of choosing the tool that fits the stack you have rather than the tool that wins the isolated evaluation.

This isn’t an argument for never adding new tools. It’s an argument for counting what you’re actually adding when you do. Sometimes the new tool is worth the surface cost. Sometimes the capability it provides can’t be replicated by what you have. Those are real cases and the right answer in those cases is to add the tool and accept the cost.

But a lot of the time, the tool that fits the stack is there and the evaluation missed it because it was comparing capabilities in isolation rather than accounting for the operational surface that comes along with each option.

The Infrastructure Math Version

Back to girl math.

Store credit is free because you already paid for it. The value is real even if the original payment was also real. The accounting that makes it feel free is legitimate. You’re not spending new money, you’re deploying already-committed value.

RGW is even better than store credit because the original payment buys more than what you paid for. I deployed Ceph for storage clustering. I get object storage as part of the same operational commitment. The CNPG backup use case justified RGW. Open WebUI document storage came along for the ride. Future MCP server workloads that need S3-compatible storage will use it too.

That’s not one thing. That’s at least four things out of one operational surface.

The cost I didn’t add: a separate object storage system to understand, monitor, upgrade, and debug. The cost I did add: one CephObjectStore custom resource in a Helm chart for a system I’m already running.

I’ll take that trade every time. Not because I’m trying to minimize the stack for its own sake, but because every tool I don’t add is a failure mode I don’t have to debug at 2am, a runbook I don’t have to write, a thing I don’t have to explain to the person who inherits this.

The operational surface is the cost. Count it.

← back to all writing