April 15, 2026 · 6 min

Network Automation Is Harder Than Server Automation. Do It Anyway.

Patrick McClory

Network automation is harder than server automation because the blast radius of a mistake is immediate and the feedback loop is brutal. That brutality is actually the advantage. It forces you to get the abstractions right in a way that server automation lets you defer.

automation networking platform-engineering

Network Automation Is Harder Than Server Automation. Do It Anyway.

Server automation is forgiving. If your Ansible role has a bug, the server ends up in a slightly wrong state, you fix the role, you run it again. The blast radius is usually contained. The feedback loop is slow enough that you can think.

Network automation is not forgiving. If your Ansible role has a bug, you might lose connectivity to the device you’re configuring. The blast radius is immediate. The feedback loop is brutal.

That brutality is actually the advantage.

Why the Feedback Loop Matters

When automation can fail silently or fail slowly, you can defer getting it right. Server automation accumulates technical debt in ways that are invisible until they aren’t: slight variances in state across nodes, idempotency that works most of the time, edge cases that only surface when something is already wrong.

Network automation doesn’t let you defer. The abstraction boundary has to be right because the hardware will tell you immediately when it isn’t. The sequencing has to be correct because RouterOS applies changes per edit and intermediate states are real states the device passes through. The idempotency has to be genuine because running a half-correct playbook twice doesn’t get you to a correct state. It might get you to a broken state that’s harder to diagnose than where you started.

The discipline that network automation demands is the discipline that all automation should have. It’s just that network automation enforces it with consequences that get your attention.

What the Unified Role Actually Does

The roles/network/ role that runs the Quorum Systems network handles both switching and routing through a single unified structure with provider-specific task subdirectories. Arista EOS and MikroTik RouterOS are both running in the same role today: different platforms, different APIs, different data models, handled through the same abstraction layer. The arista.eos collection manages the switching layer. community.routeros manages the routing layer. Same role, same logical design above the seam, different vendor implementations below it.

That’s not a theoretical multi-vendor capability. It’s running. And the pattern extends directly to every major network vendor with an Ansible collection: Cisco IOS, Juniper JunOS, whatever comes next. The abstraction boundary is the same regardless of what’s underneath it. The vendor is a variable. The design is stable. Adding a new vendor means adding a provider task subdirectory, not redesigning the role.

What it does is put the abstraction boundary in the right place. The logical design, VLANs, segments, routing relationships, firewall policy, lives above the seam. The vendor-specific implementation of that design lives below it. When the physical reality differs from the logical design, you change a variable in the right place and the role handles the rest.

The api_modify approach for RouterOS is the key decision. api_modify expresses desired state declaratively. You describe what you want. The module computes the diff against current state and applies only what needs to change. Run it twice and it converges. Run it on a device in an unknown state and it brings that device to the desired state regardless of what was there before.

That’s idempotency working correctly. Not “this playbook is safe to run twice if nothing has changed” but “this playbook will produce the correct result regardless of current state.” Those are different properties and the second one is the one that matters in production.

The CAB Meeting Problem

I’ve spent years in corporate environments watching network change control. The CAB meeting. The maintenance window. The detailed change plan with rollback steps. The secondary engineer on standby. The post-change verification checklist.

All of that exists because network changes are high-risk, manual, and hard to reverse. The governance process is the mitigation for the fact that the change itself is inconsistent, not fully traceable, and dependent on the specific engineer making it.

When I ran the unified network role on live gear for the first time and watched it converge cleanly, the thing that struck me was almost humorous: this playbook run is more consistent, more traceable, and more repeatable than any network change I’ve been part of in a corporate change control process. The diff is explicit. The desired state is version-controlled. The execution is identical regardless of who runs it or when. The audit trail is a git log.

The change control process was trying to achieve consistency, traceability, and repeatability through governance. Declarative configuration management delivers those properties through the automation itself. The governance becomes less necessary because the underlying problem it was solving no longer exists in the same form.

That’s not an argument against thoughtful change management. It’s an argument that the right automation makes thoughtful change management much simpler, because the risk profile of the change is different when the change is declarative, version-controlled, and idempotent.

What Made the Abstraction Hard to Get Right

The three-bears problem with network automation is real and documented elsewhere. The short version: the first design had too many abstraction layers and was nearly impossible to operate. The normalization attempt technically worked and was still operationally exhausting. The final design put the abstraction boundary where the hardware actually required it, not where the design looked elegant.

What made it hard was that network automation requires understanding two different things simultaneously: the logical design you’re trying to implement and the specific way each vendor’s platform expresses that design. RouterOS thinks about interface relationships differently than EOS. The data model for routing tables is different. The way firewall rules chain together is different. You can’t fully abstract those differences away without losing the ability to actually configure the devices.

What you can do is contain them. Put the vendor-specific reality in provider task subdirectories where it belongs. Keep the logical design above the seam. Accept that switching and routing are operationally different enough that a unified data model adds more complexity than it removes. And build the automation to reflect the problem as it actually is rather than the problem as you’d prefer it to be.

The playbook that’s slightly inelegant but works under pressure is worth more than the playbook that’s architecturally beautiful and requires expertise to operate.

The Operational Difference Between Works and Trusted

There’s a gap between “this works” and “I trust this.” Server automation can live in that gap for a long time because the consequences of it being wrong are recoverable. Network automation can’t.

The moment the network role felt trusted rather than just working was when I could run it on a device I hadn’t touched recently, in a state I wasn’t certain of, and rely on it to converge correctly without having to think carefully about what might go wrong. That’s a different relationship with the automation than “I’ve tested this and I think it’s right.”

Getting there required the connectivity losses. Required the sequencing work. Required finding the places where my model of how the device would behave and how the device actually behaves didn’t match. Each of those failures made the automation more trustworthy because they forced the abstraction boundary to be correct rather than approximately correct.

Approximately correct automation fails in ways that are hard to diagnose. Correct automation fails in ways that are obvious. The goal is obvious failures, not because failures are good, but because obvious failures are recoverable and approximately correct failures accumulate into states that aren’t.

Boring network changes. That’s the goal. The discipline it took to get there is what makes them boring.

← back to all writing