May 5, 2026 · 7 min
Idempotency Is a Promise
Patrick McClory
Tools offer idempotency as a good intention. Systems need it as a guarantee. Knowing when to step in and upgrade one to the other is what separates automation that works from automation that works until it doesn't.
“I ran it twice and nothing broke.”
That’s the test most engineers use for idempotency. Run the thing, run it again, confirm nothing exploded. Ship it. And it works, right up until it doesn’t, and then nobody can figure out why because the automation passed the only test anyone ever ran against it.
Idempotency means running an operation N times produces exactly the same result as running it once. Not “nothing visibly broke.” Not “it exited zero.” The same result. Same state. Same side effects. Same world after execution, regardless of how many times you execute.
The gap between “nothing broke” and “same result” is where most automation debt lives.
When the Tool’s Promise Has Conditions
I’ve written elsewhere about an Ansible handle_absent_entries incident that cost me an afternoon and most of my confidence in a config I’d been running for weeks. The short version: a task that was perfectly idempotent on full runs became destructive when the input data changed. Every address not explicitly listed got removed from the router. The task did exactly what it was told. The problem was that “exactly what it was told” changed meaning when the data was incomplete.
That’s the first flavor of broken idempotency. The tool keeps its promise, but the promise has conditions you didn’t read carefully enough. Ansible will converge your state. It just converges toward whatever you hand it, and if what you hand it is a partial list, it converges toward a partial system.
I’m not going to retell that story here. But it set the tone for everything that followed.
Building Idempotency the API Won’t Give You
MaaS is a consistent actor. It does what you ask. It validates what it can. You can’t deploy a machine that’s already deployed. You can’t modify network interfaces on a node that’s in a deployed state. The CLI has opinions, and those opinions are correct.
What it doesn’t do is reconcile. There’s no “make it look like this” command for bonds, addresses, boot disk selection, or VLAN sub-interfaces. There’s “create,” “update,” and “read.” The rest is your problem.
I knew going in that the right pattern was read-eval-update-validate. Read the current state from the API. Evaluate whether it matches the desired state. Update only what’s diverged. Validate the result. I’d hoped, somewhat irrationally, that the MaaS CLI would give me enough built-in validation to skip some of the ceremony. It didn’t. Not because MaaS is deficient. Because reconciliation isn’t MaaS’s job.
Here’s what it actually looks like. Bond creation on a bare-metal node:
-
Read all interfaces from the API. Filter to physical interfaces matching the target bond speed. Identify the PXE interface and exclude it. Check whether a bond with the target name already exists.
-
If the bond exists, check whether its params have diverged: bond mode, LACP rate, transmit hash policy, MII monitoring interval. Check whether any candidate member interfaces are missing from the bond.
-
If the bond doesn’t exist, create it. If it exists but params diverged, update them. If members are missing, add them. If everything matches, do nothing.
-
Re-read interfaces from the API. Confirm the bond is present with correct params. Fail explicitly if it isn’t.
That’s one resource type. Address assignment is the same pattern with an additional wrinkle: MaaS doesn’t support in-place link mode changes. If an interface has a link with the wrong mode or wrong IP, you have to unlink it and relink it. “Update” isn’t an operation the API offers for this resource. You tear it down and rebuild it, and you do that inside the same idempotent wrapper that skips the whole sequence if the link is already correct.
Boot disk selection. Storage layout configuration. VLAN sub-interface creation. Every one of them follows the same read-eval-update-validate cycle because none of them offer a declarative “make it so” interface.
The code isn’t complicated. Each individual task is a straightforward API call wrapped in a conditional. But the aggregate is significant: dozens of tasks across six files, all implementing the same pattern, all doing the work that a declarative reconciliation engine would do for you if one existed for this API.
This is the part that doesn’t show up in architecture diagrams. Nobody draws a box labeled “hand-built idempotency layer.” But it’s real, it’s code, and if you don’t build it, your automation is a series of imperative commands that happen to work when you run them in the right order on a clean system.
State Machines Don’t Rewind
The MaaS deployment lifecycle is a state machine. A node starts at Ready. You trigger a deploy. It moves to Deploying. It installs the OS, reboots, and arrives at Deployed. If something goes wrong, it lands in Failed deployment.
The same Ansible playbook has to handle all of those entry states. If the node is Ready, trigger the deploy and wait. If it’s Deploying, skip the trigger and just wait. If it’s Deployed, skip everything. If it’s anything else, fail explicitly with a message that says “this node is in a state I don’t handle, go fix it in MaaS and come back.”
This is where the “I ran it twice” model breaks down completely. The second run doesn’t start where the first run started. It starts wherever the first run left the system. And some of those places don’t have a backward path. A node in Failed deployment doesn’t rewind to Ready when you run the playbook again. A node that’s been deployed can’t have its storage layout reconfigured. The state moved forward and the old state is gone.
State machines are the ultimate forcing function for roll-forward thinking. You can’t undo a deployment. You can’t uncommission a node back to the state it was in before commissioning discovered those bad drives and updated the hardware inventory. The system learned something. It changed. The only direction is forward, and your automation has to be built for that reality.
The idempotency here isn’t “run it again and get the same result.” It’s “run it again and get the correct behavior for whatever state the system is actually in right now.” That’s a harder promise to keep, because you have to understand the system’s own state model well enough to know what “correct” means at every point in the lifecycle. You have to know which transitions are reversible and which ones aren’t. You have to know when “try again” is the right answer and when “investigate, fix, then move forward” is the only answer.
Once I’d committed to read-eval-update-validate on the API resources, extending that same discipline to the deploy state machine was a natural step. Read the current state. Route based on what you find. Act only if action is appropriate for where you are. Validate the outcome. The deploy automation has never produced a surprise. Not because deployment is simple. Because the pattern was already in place before we got there, and the pattern assumes the world has changed since the last time you looked.
Good Intentions and Ironclad Promises
Terraform’s idempotency model is built on a state file. The state file records what Terraform believes the world looks like. On the next run, it diffs desired state against recorded state and applies the delta. It works well when Terraform is the only actor. When the system has mutation paths outside Terraform’s view, the state file drifts from reality. The diff is computed against a fiction.
That’s why we rejected Terraform for MaaS configuration management. Not because Terraform is a bad tool. Because the requirements included mutation paths that Terraform’s state model can’t track. MaaS nodes get commissioned through the UI. Operators adjust settings through the CLI during debugging. The MaaS commissioning process itself modifies machine state in ways that no external state file can predict. Terraform’s idempotency promise has a condition: “as long as I’m the only one making changes.” That condition doesn’t hold here.
The Ansible approach with hand-built read-eval-update-validate doesn’t have that condition. Every run reads actual state from the API. There’s no cached belief about what the world looks like. The reconciliation is real-time, every time. It’s more code. It’s more work to build and more work to maintain. And it’s correct under conditions where Terraform’s promise would silently break.
This wasn’t a dramatic decision. It was the natural consequence of evaluating the requirements against what each tool actually guarantees versus what it intends.
The Upgrade
Every tool in the automation stack offers some version of idempotency. Ansible converges. Terraform reconciles. MaaS validates. All good intentions. None of them lies about what it does.
But “what they do” and “what your system needs” are different questions. Ansible converges toward whatever data you give it, including incomplete data. Terraform reconciles against a state file that can drift. MaaS validates individual operations but doesn’t reconcile across them. State machines move forward whether you’re ready or not.
Sometimes the good intention is enough. A lot of automation runs on it and never hits the edge cases. You can go a long time trusting the tool and never see the gap between intention and guarantee.
But when the system matters enough that “never got burned” isn’t an acceptable operating standard, you have to step in. You upgrade the good intention to an ironclad promise. You read the actual state. You evaluate it against what you need. You update only what’s wrong. You validate the result. You account for the fact that the world may have moved forward since the last time you checked, and that some of those moves can’t be undone.
That’s not distrusting your tools. That’s knowing where their promises end and where yours have to begin.